From f9221879f0bb699292b94ba2d94bd6702cd67f2d Mon Sep 17 00:00:00 2001
From: Chaoyu <paranoyang@gmail.com>
Date: Wed, 24 Apr 2024 13:23:24 -0700
Subject: [PATCH] docs: Update README.md (#964)

* Update README.md

Signed-off-by: Chaoyu <paranoyang@gmail.com>

* Update README.md

Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com>
Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>

---------

Signed-off-by: Chaoyu <paranoyang@gmail.com>
Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com>
---
 README.md | 161 +++++++-----------------------------------------------
 1 file changed, 19 insertions(+), 142 deletions(-)
diff --git a/README.md b/README.md
index 62160ab0..dc3117f2 100644
--- a/README.md
+++ b/README.md
@@ -3,61 +3,32 @@
 <!-- hatch-fancy-pypi-readme intro start -->
 
 <div align="center">
-    <h1 align="center">🦾 OpenLLM: Self-Hosting Large Language Models Made Easy</h1>
+    <h1 align="center">🦾 OpenLLM: Self-Hosting LLMs Made Easy</h1>
     <a href="https://pypi.org/project/openllm">
         <img src="https://img.shields.io/pypi/v/openllm.svg?logo=pypi&label=PyPI&logoColor=gold" alt="pypi_status" />
     </a><a href="https://test.pypi.org/project/openllm/">
         <img src="https://img.shields.io/badge/Nightly-PyPI?logo=pypi&label=PyPI&color=gray&link=https%3A%2F%2Ftest.pypi.org%2Fproject%2Fopenllm%2F" alt="test_pypi_status" />
-    </a><a href="https://twitter.com/bentomlai">
-        <img src="https://badgen.net/badge/icon/@bentomlai/1DA1F2?icon=twitter&label=Follow%20Us" alt="Twitter" />
-    </a><a href="https://l.bentoml.com/join-openllm-discord">
-        <img src="https://badgen.net/badge/icon/OpenLLM/7289da?icon=discord&label=Join%20Us" alt="Discord" />
     </a><a href="https://github.com/bentoml/OpenLLM/actions/workflows/ci.yml">
         <img src="https://github.com/bentoml/OpenLLM/actions/workflows/ci.yml/badge.svg?branch=main" alt="ci" />
     </a><a href="https://results.pre-commit.ci/latest/github/bentoml/OpenLLM/main">
         <img src="https://results.pre-commit.ci/badge/github/bentoml/OpenLLM/main.svg" alt="pre-commit.ci status" />
-    </a><br>
-    <a href="https://pypi.org/project/openllm">
-        <img src="https://img.shields.io/pypi/pyversions/openllm.svg?logo=python&label=Python&logoColor=gold" alt="python_version" />
-    </a><a href="https://github.com/pypa/hatch">
-        <img src="https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg" alt="Hatch" />
-    </a><a href="https://github.com/bentoml/OpenLLM/blob/main/STYLE.md">
-        <img src="https://img.shields.io/badge/code%20style-Google-000000.svg" alt="code style" />
-    </a><a href="https://github.com/astral-sh/ruff">
-        <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json" alt="Ruff" />
-    </a><a href="https://github.com/python/mypy">
-        <img src="https://img.shields.io/badge/types-mypy-blue.svg" alt="types - mypy" />
-    </a><a href="https://github.com/microsoft/pyright">
-        <img src="https://img.shields.io/badge/types-pyright-yellow.svg" alt="types - pyright" />
-    </a><br>
-    <p>Run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints, locally and in the cloud.</p>
-    <i></i>
+    </a><br><a href="https://twitter.com/bentomlai">
+        <img src="https://badgen.net/badge/icon/@bentomlai/1DA1F2?icon=twitter&label=Follow%20Us" alt="Twitter" />
+    </a><a href="https://l.bentoml.com/join-openllm-discord">
+        <img src="https://badgen.net/badge/icon/OpenLLM/7289da?icon=discord&label=Join%20Us" alt="Discord" />
+    </a>
 </div>
 
 ## 📖 Introduction
 
-OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications.
+OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment.
 
-Key features include:
 
-🚂 **State-of-the-art LLMs**: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder.
+- 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
+- ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
+- 🔥 State-of-the-art serving and inference performance
+- 🎯 Simplified cloud deployment via [BentoML](www.bentoml.com)
 
-🔥 **Flexible APIs**: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice.
-
-⛓️ **Freedom to build**: First-class support for LangChain, BentoML, LlamaIndex, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.
-
-🎯 **Streamline deployment**: Automatically generate your LLM server Docker images or deploy as serverless endpoints via
-[☁️ BentoCloud](https://l.bentoml.com/bento-cloud), which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness.
-
-🤖️ **Bring your own LLM**: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (`LLM.tuning()`) is coming soon.
-
-⚡ **Quantization**: Run inference with less computational and memory costs with quantization techniques such as [LLM.int8](https://arxiv.org/abs/2208.07339), [SpQR (int4)](https://arxiv.org/abs/2306.03078), [AWQ](https://arxiv.org/pdf/2306.00978.pdf), [GPTQ](https://arxiv.org/abs/2210.17323), and [SqueezeLLM](https://arxiv.org/pdf/2306.07629v2.pdf).
-
-📡 **Streaming**: Support token streaming through server-sent events (SSE). You can use the `/v1/generate_stream` endpoint for streaming responses from LLMs.
-
-🔄 **Continuous batching**: Support continuous batching via [vLLM](https://github.com/vllm-project/vllm) for increased total throughput.
-
-OpenLLM is designed for AI application developers working to build production-ready applications based on LLMs. It delivers a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring these models, simplifying the end-to-end deployment workflow for LLMs.
 
 <!-- hatch-fancy-pypi-readme intro stop -->
 
@@ -70,6 +41,7 @@ OpenLLM is designed for AI application developers working to build production-re
 ## 💾 TL/DR
 
 For starter, we provide two ways to quickly try out OpenLLM:
+
 ### Jupyter Notebooks
 
 Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb).
@@ -93,6 +65,7 @@ docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b
 ## 🏃 Get started
 
 The following provides instructions for how to get started with OpenLLM locally.
+
 ### Prerequisites
 
 You have installed Python 3.8 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
@@ -133,7 +106,6 @@ Commands:
   prune       Remove all saved models, (and optionally bentos) built with OpenLLM locally.
   query       Query a LLM interactively, from a terminal.
   start       Start a LLMServer for any supported LLM.
-  start-grpc  Start a gRPC LLMServer for any supported LLM.
 
 Extensions:
   build-base-container  Base image builder for BentoLLM.
@@ -1121,49 +1093,11 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
 > [!IMPORTANT]
 > Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.
 
-## 🐍 Python SDK
-
-Each LLM can be instantiated with `openllm.LLM`:
-
-```python
-import openllm
-
-llm = openllm.LLM('microsoft/phi-2')
-```
-
-The main inference API is the streaming `generate_iterator` method:
-
-```python
-async for generation in llm.generate_iterator('What is the meaning of life?'):
-  print(generation.outputs[0].text)
-```
-
-> [!NOTE]
-> The motivation behind making `llm.generate_iterator` an async generator is to provide support for Continuous batching with vLLM backend. By having the async endpoints, each prompt
-> will be added correctly to the request queue to process with vLLM backend.
-
-There is also a _one-shot_ `generate` method:
-
-```python
-await llm.generate('What is the meaning of life?')
-```
-
-This method is easy to use for one-shot generation use case, but merely served as an example how to use `llm.generate_iterator` as it uses `generate_iterator` under the hood.
-
-> [!IMPORTANT]
-> If you need to call your code in a synchronous context, you can use `asyncio.run` that wraps an async function:
->
-> ```python
-> import asyncio
-> async def generate(prompt, **attrs): return await llm.generate(prompt, **attrs)
-> asyncio.run(generate("The meaning of life is", temperature=0.23))
-> ```
 
 ## ⚙️ Integrations
 
 OpenLLM is not just a standalone product; it's a building block designed to
 integrate with other powerful tools easily. We currently offer integration with
-[BentoML](https://github.com/bentoml/BentoML),
 [OpenAI's Compatible Endpoints](https://platform.openai.com/docs/api-reference/completions/object),
 [LlamaIndex](https://www.llamaindex.ai/),
 [LangChain](https://github.com/hwchase17/langchain), and
@@ -1192,29 +1126,8 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod
 > You can find out OpenAI example clients under the
 > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
 
-### BentoML
 
-OpenLLM LLM can be integrated as a
-[Runner](https://docs.bentoml.com/en/latest/concepts/runner.html) in your
-BentoML service. Simply call `await llm.generate` to generate text. Note that
-`llm.generate` uses `runner` under the hood:
-
-```python
-import bentoml
-import openllm
-
-llm = openllm.LLM('microsoft/phi-2')
-
-svc = bentoml.Service(name='llm-phi-service', runners=[llm.runner])
-
-
-@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
-async def prompt(input_text: str) -> str:
-  generation = await llm.generate(input_text)
-  return generation.outputs[0].text
-```
-
-### [LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/modules.html#openllm)
+### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/)
 
 To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`:
 
@@ -1244,26 +1157,9 @@ from llama_index.llms.openllm import OpenLLMAPI
 > [!NOTE]
 > All synchronous and asynchronous API from `llama_index.llms.LLM` are supported.
 
-### [LangChain](https://python.langchain.com/docs/ecosystem/integrations/openllm)
+### [LangChain](https://python.langchain.com/docs/integrations/llms/openllm/)
 
-To quickly start a local LLM with `langchain`, simply do the following:
-
-```python
-from langchain.llms import OpenLLM
-
-llm = OpenLLM(model_name='llama', model_id='meta-llama/Llama-2-7b-hf')
-
-llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
-```
-
-> [!IMPORTANT]
-> By default, OpenLLM use `safetensors` format for saving models.
-> If the model doesn't support safetensors, make sure to pass
-> `serialisation="legacy"` to use the legacy PyTorch bin format.
-
-`langchain.llms.OpenLLM` has the capability to interact with remote OpenLLM
-Server. Given there is an OpenLLM server deployed elsewhere, you can connect to
-it by specifying its URL:
+Spin up an OpenLLM server, and connect to it by specifying its URL:
 
 ```python
 from langchain.llms import OpenLLM
@@ -1272,24 +1168,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
 llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
 ```
 
-To integrate a LangChain agent with BentoML, you can do the following:
-
-```python
-llm = OpenLLM(model_id='google/flan-t5-large', embedded=False, serialisation='legacy')
-tools = load_tools(['serpapi', 'llm-math'], llm=llm)
-agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
-svc = bentoml.Service('langchain-openllm', runners=[llm.runner])
-
-
-@svc.api(input=Text(), output=Text())
-def chat(input_text: str):
-  return agent.run(input_text)
-```
-
-> [!NOTE]
-> You can find out more examples under the
-> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
-
 ### Transformers Agents
 
 OpenLLM seamlessly integrates with
@@ -1346,11 +1224,10 @@ There are several ways to deploy your LLMs:
 
 ### ☁️ BentoCloud
 
-Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/bento-cloud/), the
-serverless cloud for shipping and scaling AI applications.
+Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/), the inference platform
+for fast moving AI teams.
 
-1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/cloud)
-   for early access
+1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/)
 
 2. **Log into your BentoCloud account:**