From f9221879f0bb699292b94ba2d94bd6702cd67f2d Mon Sep 17 00:00:00 2001 From: Chaoyu Date: Wed, 24 Apr 2024 13:23:24 -0700 Subject: [PATCH] docs: Update README.md (#964) * Update README.md Signed-off-by: Chaoyu * Update README.md Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com> Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com> --------- Signed-off-by: Chaoyu Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com> Co-authored-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com> Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com> --- README.md | 161 +++++++----------------------------------------------- 1 file changed, 19 insertions(+), 142 deletions(-) diff --git a/README.md b/README.md index 62160ab0..dc3117f2 100644 --- a/README.md +++ b/README.md @@ -3,61 +3,32 @@
-

๐Ÿฆพ OpenLLM: Self-Hosting Large Language Models Made Easy

+

๐Ÿฆพ OpenLLM: Self-Hosting LLMs Made Easy

pypi_status test_pypi_status - - Twitter - - Discord ci pre-commit.ci status -
- - python_version - - Hatch - - code style - - Ruff - - types - mypy - - types - pyright -
-

Run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints, locally and in the cloud.

- +
+ Twitter + + Discord +
## ๐Ÿ“– Introduction -OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications. +OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment. -Key features include: -๐Ÿš‚ **State-of-the-art LLMs**: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. +- ๐Ÿš‚ Support a wide range of open-source LLMs including LLMs fine-tuned with your own data +- โ›“๏ธ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs +- ๐Ÿ”ฅ State-of-the-art serving and inference performance +- ๐ŸŽฏ Simplified cloud deployment via [BentoML](www.bentoml.com) -๐Ÿ”ฅ **Flexible APIs**: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice. - -โ›“๏ธ **Freedom to build**: First-class support for LangChain, BentoML, LlamaIndex, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services. - -๐ŸŽฏ **Streamline deployment**: Automatically generate your LLM server Docker images or deploy as serverless endpoints via -[โ˜๏ธ BentoCloud](https://l.bentoml.com/bento-cloud), which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness. - -๐Ÿค–๏ธ **Bring your own LLM**: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (`LLM.tuning()`) is coming soon. - -โšกย **Quantization**: Run inference with less computational and memory costs with quantization techniques such as [LLM.int8](https://arxiv.org/abs/2208.07339), [SpQR (int4)](https://arxiv.org/abs/2306.03078), [AWQ](https://arxiv.org/pdf/2306.00978.pdf),ย [GPTQ](https://arxiv.org/abs/2210.17323), and [SqueezeLLM](https://arxiv.org/pdf/2306.07629v2.pdf). - -๐Ÿ“กย **Streaming**: Support token streaming through server-sent events (SSE). You can use the `/v1/generate_stream`ย endpoint for streaming responses from LLMs. - -๐Ÿ”„ย **Continuous batching**: Support continuous batching via [vLLM](https://github.com/vllm-project/vllm) for increased total throughput. - -OpenLLM is designed for AI application developers working to build production-ready applications based on LLMs. It delivers a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring these models, simplifying the end-to-end deployment workflow for LLMs. @@ -70,6 +41,7 @@ OpenLLM is designed for AI application developers working to build production-re ## ๐Ÿ’พ TL/DR For starter, we provide two ways to quickly try out OpenLLM: + ### Jupyter Notebooks Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb). @@ -93,6 +65,7 @@ docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b ## ๐Ÿƒ Get started The following provides instructions for how to get started with OpenLLM locally. + ### Prerequisites You have installed Python 3.8 (or later) andย `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts. @@ -133,7 +106,6 @@ Commands: prune Remove all saved models, (and optionally bentos) built with OpenLLM locally. query Query a LLM interactively, from a terminal. start Start a LLMServer for any supported LLM. - start-grpc Start a gRPC LLMServer for any supported LLM. Extensions: build-base-container Base image builder for BentoLLM. @@ -1121,49 +1093,11 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx . > [!IMPORTANT] > Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon. -## ๐Ÿ Python SDK - -Each LLM can be instantiated with `openllm.LLM`: - -```python -import openllm - -llm = openllm.LLM('microsoft/phi-2') -``` - -The main inference API is the streaming `generate_iterator` method: - -```python -async for generation in llm.generate_iterator('What is the meaning of life?'): - print(generation.outputs[0].text) -``` - -> [!NOTE] -> The motivation behind making `llm.generate_iterator` an async generator is to provide support for Continuous batching with vLLM backend. By having the async endpoints, each prompt -> will be added correctly to the request queue to process with vLLM backend. - -There is also a _one-shot_ `generate` method: - -```python -await llm.generate('What is the meaning of life?') -``` - -This method is easy to use for one-shot generation use case, but merely served as an example how to use `llm.generate_iterator` as it uses `generate_iterator` under the hood. - -> [!IMPORTANT] -> If you need to call your code in a synchronous context, you can use `asyncio.run` that wraps an async function: -> -> ```python -> import asyncio -> async def generate(prompt, **attrs): return await llm.generate(prompt, **attrs) -> asyncio.run(generate("The meaning of life is", temperature=0.23)) -> ``` ## โš™๏ธ Integrations OpenLLM is not just a standalone product; it's a building block designed to integrate with other powerful tools easily. We currently offer integration with -[BentoML](https://github.com/bentoml/BentoML), [OpenAI's Compatible Endpoints](https://platform.openai.com/docs/api-reference/completions/object), [LlamaIndex](https://www.llamaindex.ai/), [LangChain](https://github.com/hwchase17/langchain), and @@ -1192,29 +1126,8 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod > You can find out OpenAI example clients under the > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder. -### BentoML -OpenLLM LLM can be integrated as a -[Runner](https://docs.bentoml.com/en/latest/concepts/runner.html) in your -BentoML service. Simply call `await llm.generate` to generate text. Note that -`llm.generate` uses `runner` under the hood: - -```python -import bentoml -import openllm - -llm = openllm.LLM('microsoft/phi-2') - -svc = bentoml.Service(name='llm-phi-service', runners=[llm.runner]) - - -@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text()) -async def prompt(input_text: str) -> str: - generation = await llm.generate(input_text) - return generation.outputs[0].text -``` - -### [LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/modules.html#openllm) +### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/) To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`: @@ -1244,26 +1157,9 @@ from llama_index.llms.openllm import OpenLLMAPI > [!NOTE] > All synchronous and asynchronous API from `llama_index.llms.LLM` are supported. -### [LangChain](https://python.langchain.com/docs/ecosystem/integrations/openllm) +### [LangChain](https://python.langchain.com/docs/integrations/llms/openllm/) -To quickly start a local LLM with `langchain`, simply do the following: - -```python -from langchain.llms import OpenLLM - -llm = OpenLLM(model_name='llama', model_id='meta-llama/Llama-2-7b-hf') - -llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?') -``` - -> [!IMPORTANT] -> By default, OpenLLM use `safetensors` format for saving models. -> If the model doesn't support safetensors, make sure to pass -> `serialisation="legacy"` to use the legacy PyTorch bin format. - -`langchain.llms.OpenLLM` has the capability to interact with remote OpenLLM -Server. Given there is an OpenLLM server deployed elsewhere, you can connect to -it by specifying its URL: +Spin up an OpenLLM server, and connect to it by specifying its URL: ```python from langchain.llms import OpenLLM @@ -1272,24 +1168,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http') llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?') ``` -To integrate a LangChain agent with BentoML, you can do the following: - -```python -llm = OpenLLM(model_id='google/flan-t5-large', embedded=False, serialisation='legacy') -tools = load_tools(['serpapi', 'llm-math'], llm=llm) -agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION) -svc = bentoml.Service('langchain-openllm', runners=[llm.runner]) - - -@svc.api(input=Text(), output=Text()) -def chat(input_text: str): - return agent.run(input_text) -``` - -> [!NOTE] -> You can find out more examples under the -> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder. - ### Transformers Agents OpenLLM seamlessly integrates with @@ -1346,11 +1224,10 @@ There are several ways to deploy your LLMs: ### โ˜๏ธ BentoCloud -Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/bento-cloud/), the -serverless cloud for shipping and scaling AI applications. +Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/), the inference platform +for fast moving AI teams. -1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/cloud) - for early access +1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/) 2. **Log into your BentoCloud account:**