diff --git a/.python-version-default b/.python-version-default index c8cfe395..bd28b9c5 100644 --- a/.python-version-default +++ b/.python-version-default @@ -1 +1 @@ -3.10 +3.9 diff --git a/openllm-python/README.md b/openllm-python/README.md index 35dcc008..dc3117f2 100644 --- a/openllm-python/README.md +++ b/openllm-python/README.md @@ -3,62 +3,32 @@
## ๐ Introduction -OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications. +OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment. -Key features include: -๐ **State-of-the-art LLMs**: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. +- ๐ Support a wide range of open-source LLMs including LLMs fine-tuned with your own data +- โ๏ธ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs +- ๐ฅ State-of-the-art serving and inference performance +- ๐ฏ Simplified cloud deployment via [BentoML](www.bentoml.com) -๐ฅ **Flexible APIs**: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice. - -โ๏ธ **Freedom to build**: First-class support for LangChain, BentoML, LlamaIndex, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services. - -๐ฏ **Streamline deployment**: Automatically generate your LLM server Docker images or deploy as serverless endpoints via -[โ๏ธ BentoCloud](https://l.bentoml.com/bento-cloud), which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness. - -๐ค๏ธ **Bring your own LLM**: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (`LLM.tuning()`) is coming soon. - -โกย **Quantization**: Run inference with less computational and memory costs with quantization techniques such as [LLM.int8](https://arxiv.org/abs/2208.07339), [SpQR (int4)](https://arxiv.org/abs/2306.03078), [AWQ](https://arxiv.org/pdf/2306.00978.pdf),ย [GPTQ](https://arxiv.org/abs/2210.17323), and [SqueezeLLM](https://arxiv.org/pdf/2306.07629v2.pdf). - -๐กย **Streaming**: Support token streaming through server-sent events (SSE). You can use the `/v1/generate_stream`ย endpoint for streaming responses from LLMs. - -๐ย **Continuous batching**: Support continuous batching via [vLLM](https://github.com/vllm-project/vllm) for increased total throughput. - -OpenLLM is designed for AI application developers working to build production-ready applications based on LLMs. It delivers a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring these models, simplifying the end-to-end deployment workflow for LLMs. @@ -71,6 +41,7 @@ OpenLLM is designed for AI application developers working to build production-re ## ๐พ TL/DR For starter, we provide two ways to quickly try out OpenLLM: + ### Jupyter Notebooks Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb). @@ -94,6 +65,7 @@ docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b ## ๐ Get started The following provides instructions for how to get started with OpenLLM locally. + ### Prerequisites You have installed Python 3.8 (or later) andย `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts. @@ -134,7 +106,6 @@ Commands: prune Remove all saved models, (and optionally bentos) built with OpenLLM locally. query Query a LLM interactively, from a terminal. start Start a LLMServer for any supported LLM. - start-grpc Start a gRPC LLMServer for any supported LLM. Extensions: build-base-container Base image builder for BentoLLM. @@ -1122,49 +1093,11 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx . > [!IMPORTANT] > Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon. -## ๐ Python SDK - -Each LLM can be instantiated with `openllm.LLM`: - -```python -import openllm - -llm = openllm.LLM('microsoft/phi-2') -``` - -The main inference API is the streaming `generate_iterator` method: - -```python -async for generation in llm.generate_iterator('What is the meaning of life?'): - print(generation.outputs[0].text) -``` - -> [!NOTE] -> The motivation behind making `llm.generate_iterator` an async generator is to provide support for Continuous batching with vLLM backend. By having the async endpoints, each prompt -> will be added correctly to the request queue to process with vLLM backend. - -There is also a _one-shot_ `generate` method: - -```python -await llm.generate('What is the meaning of life?') -``` - -This method is easy to use for one-shot generation use case, but merely served as an example how to use `llm.generate_iterator` as it uses `generate_iterator` under the hood. - -> [!IMPORTANT] -> If you need to call your code in a synchronous context, you can use `asyncio.run` that wraps an async function: -> -> ```python -> import asyncio -> async def generate(prompt, **attrs): return await llm.generate(prompt, **attrs) -> asyncio.run(generate("The meaning of life is", temperature=0.23)) -> ``` ## โ๏ธ Integrations OpenLLM is not just a standalone product; it's a building block designed to integrate with other powerful tools easily. We currently offer integration with -[BentoML](https://github.com/bentoml/BentoML), [OpenAI's Compatible Endpoints](https://platform.openai.com/docs/api-reference/completions/object), [LlamaIndex](https://www.llamaindex.ai/), [LangChain](https://github.com/hwchase17/langchain), and @@ -1193,29 +1126,8 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod > You can find out OpenAI example clients under the > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder. -### BentoML -OpenLLM LLM can be integrated as a -[Runner](https://docs.bentoml.com/en/latest/concepts/runner.html) in your -BentoML service. Simply call `await llm.generate` to generate text. Note that -`llm.generate` uses `runner` under the hood: - -```python -import bentoml -import openllm - -llm = openllm.LLM('microsoft/phi-2') - -svc = bentoml.Service(name='llm-phi-service', runners=[llm.runner]) - - -@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text()) -async def prompt(input_text: str) -> str: - generation = await llm.generate(input_text) - return generation.outputs[0].text -``` - -### [LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/modules.html#openllm) +### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/) To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`: @@ -1245,26 +1157,9 @@ from llama_index.llms.openllm import OpenLLMAPI > [!NOTE] > All synchronous and asynchronous API from `llama_index.llms.LLM` are supported. -### [LangChain](https://python.langchain.com/docs/ecosystem/integrations/openllm) +### [LangChain](https://python.langchain.com/docs/integrations/llms/openllm/) -To quickly start a local LLM with `langchain`, simply do the following: - -```python -from langchain.llms import OpenLLM - -llm = OpenLLM(model_name='llama', model_id='meta-llama/Llama-2-7b-hf') - -llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?') -``` - -> [!IMPORTANT] -> By default, OpenLLM use `safetensors` format for saving models. -> If the model doesn't support safetensors, make sure to pass -> `serialisation="legacy"` to use the legacy PyTorch bin format. - -`langchain.llms.OpenLLM` has the capability to interact with remote OpenLLM -Server. Given there is an OpenLLM server deployed elsewhere, you can connect to -it by specifying its URL: +Spin up an OpenLLM server, and connect to it by specifying its URL: ```python from langchain.llms import OpenLLM @@ -1273,24 +1168,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http') llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?') ``` -To integrate a LangChain agent with BentoML, you can do the following: - -```python -llm = OpenLLM(model_id='google/flan-t5-large', embedded=False, serialisation='legacy') -tools = load_tools(['serpapi', 'llm-math'], llm=llm) -agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION) -svc = bentoml.Service('langchain-openllm', runners=[llm.runner]) - - -@svc.api(input=Text(), output=Text()) -def chat(input_text: str): - return agent.run(input_text) -``` - -> [!NOTE] -> You can find out more examples under the -> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder. - ### Transformers Agents OpenLLM seamlessly integrates with @@ -1347,11 +1224,10 @@ There are several ways to deploy your LLMs: ### โ๏ธ BentoCloud -Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/bento-cloud/), the -serverless cloud for shipping and scaling AI applications. +Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/), the inference platform +for fast moving AI teams. -1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/cloud) - for early access +1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/) 2. **Log into your BentoCloud account:** diff --git a/openllm-python/pyproject.toml b/openllm-python/pyproject.toml index aa61c289..95761dd0 100644 --- a/openllm-python/pyproject.toml +++ b/openllm-python/pyproject.toml @@ -55,7 +55,7 @@ dependencies = [ "cuda-python;platform_system!=\"Darwin\"", "bitsandbytes<0.42", ] -description = "OpenLLM: Operating LLMs in production" +description = "OpenLLM: Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud." dynamic = ["version", "readme"] keywords = [ "MLOps", diff --git a/openllm-python/src/_openllm_tiny/_llm.py b/openllm-python/src/_openllm_tiny/_llm.py index ed4f004c..541c82d7 100644 --- a/openllm-python/src/_openllm_tiny/_llm.py +++ b/openllm-python/src/_openllm_tiny/_llm.py @@ -213,12 +213,13 @@ class LLM: token_ids[output.index].extend(output.token_ids) if (final_result := result) is None: raise RuntimeError('No result is returned.') - return GenerationOutput.from_vllm(final_result).model_copy( + converted = GenerationOutput.from_vllm(final_result) + return converted.model_copy( update=dict( prompt=prompt, outputs=[ output.model_copy(update=dict(text=''.join(texts[output.index]), token_ids=token_ids[output.index])) - for output in final_result.outputs + for output in converted.outputs ], ) )