mirror of
https://github.com/bentoml/OpenLLM.git
synced 2026-01-25 15:57:51 -05:00
* chore: update documentation Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: update readme Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: update documentations for configuration Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> --------- Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
787 lines
382 KiB
Plaintext
787 lines
382 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "mBcdKd8Z7WYH"
|
|
},
|
|
"source": [
|
|
"<center>\n",
|
|
" <p style=\"text-align:center\">\n",
|
|
" <h1>OpenLLM</h1>\n",
|
|
" <img alt=\"BentoML logo\" src=\"https://raw.githubusercontent.com/bentoml/BentoML/main/docs/source/_static/img/bentoml-logo-black.png\" width=\"200\"/>\n",
|
|
" </br>\n",
|
|
" <a href=\"https://github.com/bentoml/OpenLLM\">GitHub</a>\n",
|
|
" |\n",
|
|
" <a href=\"https://l.bentoml.com/join-openllm-discord\">Community</a>\n",
|
|
" </p>\n",
|
|
"</center>\n",
|
|
"<h1 align=\"center\">Serving Llama2 with OpenLLM</h1>\n",
|
|
"\n",
|
|
"[OpenLLM](https://github.com/bentoml/OpenLLM) is an open-source framework for serving and operating any Large Language Models (LLMs) in production."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "fqZysSucYlJv"
|
|
},
|
|
"source": [
|
|
"This is a project demonstrating basic usage of OpenLLM with\n",
|
|
"Llama 2 as an example. In this tutorial, you will learn the following:\n",
|
|
"\n",
|
|
"- Set up your environment to work with OpenLLM.\n",
|
|
"- Use OpenLLM Python APIs to create a demo.\n",
|
|
"- Serve LLMs like Llama 2 with just a single command.\n",
|
|
"- Explore different ways to interact with the OpenLLM server.\n",
|
|
"- Build bentos for production deployment."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ZQ777iu3J69B"
|
|
},
|
|
"source": [
|
|
"## Set up the environment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "w22BwSTCGa44"
|
|
},
|
|
"source": [
|
|
"You can try this demo in one of the following ways:\n",
|
|
"\n",
|
|
"1. Via Google Colab.\n",
|
|
"\n",
|
|
" We recommend you run this demo on a GPU. To verify if you're using a GPU on Google Colab, check the runtime type in the top left corner.\n",
|
|
"\n",
|
|
" To change the runtime type: In the toolbar menu, click **Runtime** > **Change runtime type** > Select the GPU (T4)\n",
|
|
" \n",
|
|
" Paid users may have access to more advanced GPUs (e.g, A100). For free users, the T4 GPU might occasionally be unavailable.\n",
|
|
"\n",
|
|
"2. (Optional) Run this project locally.\n",
|
|
"\n",
|
|
" If you have a GPU, you can also run this notebook locally with:\n",
|
|
" \n",
|
|
" ```\n",
|
|
" git clone git@github.com:bentoml/OpenLLM.git && cd OpenLLM/examples/openllm-llama2-demo && jupyter notebook\n",
|
|
" ```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "DZO7EF_s5kbu"
|
|
},
|
|
"source": [
|
|
"### [Optional] Check GPU and memory resources"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "jnDCYgl0tm0X"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"##@title Check the memory and GPU info you have\n",
|
|
"import psutil\n",
|
|
"import torch\n",
|
|
"\n",
|
|
"\n",
|
|
"ram = psutil.virtual_memory()\n",
|
|
"ram_total = ram.total / (1024**3)\n",
|
|
"print('MemTotal: %.2f GB' % ram_total)\n",
|
|
"\n",
|
|
"print('=============GPU INFO=============')\n",
|
|
"if torch.cuda.is_available():\n",
|
|
" !/opt/bin/nvidia-smi || ture\n",
|
|
"else:\n",
|
|
" print('GPU NOT available')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "Q0ZTpnVOKYbj"
|
|
},
|
|
"source": [
|
|
"### Install required dependencies"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "ea6PG28SsdvA",
|
|
"outputId": "57f9b33c-b73b-4cc2-ae9d-6e0175f3cb60"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install -U -q openllm[llama,vllm] openai langchain\n",
|
|
"!apt install tensorrt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "RAVscZBnNozc"
|
|
},
|
|
"source": [
|
|
"## Python API demo"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "8aGgALP0L4Cs"
|
|
},
|
|
"source": [
|
|
"You can create a simple demo of serving LLMs quickly by using Openllm runner.\n",
|
|
"Learn more about Runners in the [BentoML documentation](https://docs.bentoml.com/en/latest/concepts/runner.html).\n",
|
|
"\n",
|
|
"Firstly, let's initialize an OpenLLM Runner locally.\n",
|
|
"- Here, we're serving the smallest Llama 2 model, the 7 billion parameter version. You can simply change 7b in the string with 13b or 70b for the larger parameter sizes. but a larger model means larger VRAM.\n",
|
|
"- we recommend you use vLLM as the backend for better performance.\n",
|
|
"- we set `torch_dtype` as `float16` manaually to avoid such in-compatible error that may have when loading `Bfloat16` on T4 or V100 GPUs. You do not have to do this if you are runing on more more advanced GPUs.\n",
|
|
" ```\n",
|
|
" ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5.\n",
|
|
" ```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "z2kiJ3Pe8teq"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import openllm\n",
|
|
"\n",
|
|
"\n",
|
|
"llm = openllm.LLM('NousResearch/Nous-Hermes-llama-2-7b', backend='vllm', torch_dtype='float16', embedded=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "3LOHjoBOMj2L"
|
|
},
|
|
"source": [
|
|
"### Test it with an prompt "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "7lUEDZkC8s-N",
|
|
"outputId": "d310f106-8c9c-43b9-86c8-d03dba3725e4"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import asyncio\n",
|
|
"\n",
|
|
"import nest_asyncio\n",
|
|
"\n",
|
|
"\n",
|
|
"nest_asyncio.apply()\n",
|
|
"\n",
|
|
"\n",
|
|
"async def main():\n",
|
|
" async for gen in llm.generate_iterator('What is the weather like in San Francisco?', max_new_tokens=128):\n",
|
|
" print(gen.outputs[0].text, flush=True, end='')\n",
|
|
"\n",
|
|
"\n",
|
|
"asyncio.run(main())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "vu7Y8c4XOPiP"
|
|
},
|
|
"source": [
|
|
"### Clean up\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "ihFtzXWCOW8G",
|
|
"outputId": "083803d7-9cb0-4be7-d5f8-ab1cb4d764bc"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import gc\n",
|
|
"import torch\n",
|
|
"\n",
|
|
"del llm\n",
|
|
"\n",
|
|
"torch.cuda.empty_cache()\n",
|
|
"gc.collect()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "HTJIVFrU75zF"
|
|
},
|
|
"source": [
|
|
"## OpenLLM server demo"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "gI1E8Lu4Z2zi"
|
|
},
|
|
"source": [
|
|
"### Launch an OpenLLM server\n",
|
|
"\n",
|
|
"it is quite straightforward to Launch an OpenLLM server. With just a single command\n",
|
|
"```\n",
|
|
"openllm start {MODEL_ID} --backend [pt|vllm|...]\n",
|
|
"```\n",
|
|
"\n",
|
|
"OpenLLM supports a variety of LLMs and archtiectures. Learn more in https://github.com/bentoml/OpenLLM#-supported-models.\n",
|
|
"\n",
|
|
"To unblock the following steps, run it in the background via `nohup`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"id": "1KxYxYCZ8s5D"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!nohup openllm start NousResearch/Nous-Hermes-llama-2-7b --port 8001 --dtype float16 --backend vllm > openllm.log 2>&1 &"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "6Cgu02pwbuOf"
|
|
},
|
|
"source": [
|
|
"### [IMPORTANT] Server status check"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "_57J8LBCwLza"
|
|
},
|
|
"source": [
|
|
"Before you interact with the OpenLLM server, it's crucial to ensure that it is up and running. The output of the `curl` command should start with `HTTP/1.1 200 OK`, meaning everything is in order.\n",
|
|
"\n",
|
|
"If it says `curl: (7) Failed to connect to localhost...`, then check `./openllm.log`; likely the server has failed to start or is still in the process of starting.\n",
|
|
"\n",
|
|
"If it says `HTTP/1.1 503 Service Unavailable`, the server is still starting and you should wait a bit and retry."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "rzVfeo9Tbytk",
|
|
"outputId": "4d75dd4b-a248-4bc3-f675-1465dd8ea07c"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"! curl -i http://127.0.0.1:8001/readyz"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "jo2d1_g4-lAW"
|
|
},
|
|
"source": [
|
|
"### Interact with the LLM server\n",
|
|
"\n",
|
|
"Use one of the following ways to access the server.\n",
|
|
"\n",
|
|
"1. Run the `openllm query` command to query the model:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "PvWPuh6Q-6Vq",
|
|
"outputId": "0ef77a28-f91c-4eac-fea8-3eb5724cb84b"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!openllm query --endpoint http://127.0.0.1:8001 --timeout 120 \"What is the weight of the earth?\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "aBvXcOL7a__J"
|
|
},
|
|
"source": [
|
|
"2. if you are in Google Colab, visit the web UI."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "VTvLhCDabFpe"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import sys\n",
|
|
"\n",
|
|
"\n",
|
|
"if 'google.colab' in sys.modules:\n",
|
|
" # using colab proxy URL\n",
|
|
" from google.colab.output import eval_js\n",
|
|
"\n",
|
|
" print('you are in colab runtime. please try it out in %s' % eval_js('google.colab.kernel.proxyPort(8001)'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "2j25weRLUFlb"
|
|
},
|
|
"source": [
|
|
"3. Use OpenLLM's built-in Python client."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "AQX6pOz8BEu3",
|
|
"outputId": "dea431a8-9bec-496d-a802-66895be41365"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import openllm\n",
|
|
"\n",
|
|
"\n",
|
|
"# sync API\n",
|
|
"client = openllm.HTTPClient('http://127.0.0.1:8001', timeout=120)\n",
|
|
"res = client.generate('What is the weight of the earth?', max_new_tokens=8192)\n",
|
|
"\n",
|
|
"# Async API\n",
|
|
"# async_client = openllm.AsyncHTTPClient(\"http://127.0.0.1:8001\", timeout=120)\n",
|
|
"# res = await async_client.generate(\"what is the weight of the earth?\", max_new_tokens=8192)\n",
|
|
"print(res.outputs[0].text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "MNHjOpdpOpd-",
|
|
"outputId": "738dee42-6b9b-4b72-a650-e57a7dea9f87"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# streaming\n",
|
|
"for it in client.generate_stream('What is the weight of the earth?', max_new_tokens=2048, n=2, best_of=2):\n",
|
|
" #print(f'index {it.index}: {it.text} (token: {it.token_ids})')\n",
|
|
" print(it.text, end=\"\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "qgq0TOZqUm_Z"
|
|
},
|
|
"source": [
|
|
"4. Send a request using `curl`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "wQlwT1wc_M-r",
|
|
"outputId": "cd336122-9516-4415-b6cc-dc4f67c4f2c0"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!curl -k -X 'POST' -N \\\n",
|
|
" 'http://127.0.0.1:8001/v1/generate_stream' \\\n",
|
|
" -H 'accept: text/event-stream' \\\n",
|
|
" -H 'Content-Type: application/json' \\\n",
|
|
" -d '{\"prompt\":\"write a tagline for an ice cream shop\", \"llm_config\": {\"max_new_tokens\": 256}}'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "vFAYN1_o_bRS"
|
|
},
|
|
"source": [
|
|
"5. Use the OpenAI compatible endpoint."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "0G5clTYV_M8J",
|
|
"outputId": "bf792612-0c21-482c-8a80-4fbc5e71ebc4"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import openai\n",
|
|
"import os\n",
|
|
"\n",
|
|
"\n",
|
|
"client = openai.OpenAI(base_url=os.getenv('OPENLLM_ENDPOINT', 'http://localhost:8001') + '/v1', api_key='na')\n",
|
|
"models = client.models.list()\n",
|
|
"print('Models:', models.model_dump_json(indent=2))\n",
|
|
"model = models.data[0].id\n",
|
|
"\n",
|
|
"#os.environ[\"STREAM\"] = \"TRUE\"\n",
|
|
"stream = str(os.getenv('STREAM', False)).upper() in ['TRUE', '1', 'YES', 'Y', 'ON']\n",
|
|
"\n",
|
|
"completions = client.completions.create(\n",
|
|
" prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream\n",
|
|
")\n",
|
|
"\n",
|
|
"print(f'Completion result (stream={stream}):')\n",
|
|
"if stream:\n",
|
|
" for chunk in completions:\n",
|
|
" text = chunk.choices[0].text\n",
|
|
" if text:\n",
|
|
" print(text, flush=True, end='')\n",
|
|
"else:\n",
|
|
" print(completions.choices[0].text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "T3JOFs9sBNyH"
|
|
},
|
|
"source": [
|
|
"#### LangChain integration\n",
|
|
"\n",
|
|
"OpenLLM supports integration with LangChain. You can use `langchain.llms.OpenLLM` to interact with the remote OpenLLM server. You can connect to it by specifying its URL:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {
|
|
"id": "GTg055FH_M5w"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.llms import OpenLLM\n",
|
|
"\n",
|
|
"\n",
|
|
"llm = OpenLLM(server_url='http://localhost:8001')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "Wyiz3fLoBeJ2"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.chains import LLMChain\n",
|
|
"from langchain.prompts import PromptTemplate\n",
|
|
"\n",
|
|
"\n",
|
|
"template = 'What is a good name for a company that makes {product}?'\n",
|
|
"\n",
|
|
"prompt = PromptTemplate(template=template, input_variables=['product'])\n",
|
|
"\n",
|
|
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
|
|
"\n",
|
|
"generated = llm_chain.run(product='mechanical keyboard')\n",
|
|
"print(generated)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "DNZFqf-gAdgF"
|
|
},
|
|
"source": [
|
|
"### Stop the server in the background"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {
|
|
"id": "zZX8ICQmAdTi"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pkill -f 'bentoml|openllm'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "4tUAIkMuBzPs"
|
|
},
|
|
"source": [
|
|
"## Deploy Llama2 in production with BentoCloud\n",
|
|
"\n",
|
|
"After you test the server, you can deploy it in production using BentoCloud."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "otXwv1No1RXP"
|
|
},
|
|
"source": [
|
|
"### What is BentoCloud?\n",
|
|
"\n",
|
|
"[BentoCloud](https://www.bentoml.com/cloud) is a fully-managed platform designed for building and operating AI applications.\n",
|
|
"\n",
|
|
" * Easiest way to deploy and operate AI applications.\n",
|
|
" * Natively support the OpenLLM workflow and optimization.\n",
|
|
"\n",
|
|
"If you don't have a BentoCloud account, visit the [BentoCloud website](https://www.bentoml.com/cloud) to start a free trial.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "YRtSpeyQ2A0S"
|
|
},
|
|
"source": [
|
|
"You can follow the steps in these blog posts to deploy your LLM to BentoCloud:\n",
|
|
"\n",
|
|
"* [llama2-7b](https://www.bentoml.com/blog/deploying-llama-2-7b-on-bentocloud)\n",
|
|
"* [llama2-13b](https://www.bentoml.com/blog/openllm-in-action-part-2-deploying-llama-2-13b-on-bentocloud)\n",
|
|
"* [llama2-70b]()\n",
|
|
"\n",
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "n-3NqbdB3PPM"
|
|
},
|
|
"source": [
|
|
"### Build a Bento\n",
|
|
"\n",
|
|
"Use OpenLLM to build the model into a standardized distribution unit in BentoML, also known as a Bento. Command:\n",
|
|
"\n",
|
|
"```\n",
|
|
"openllm build {model-id} --backend [pt|vllm]\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "BKpvHxGNwhIK",
|
|
"outputId": "30de9160-8c05-4cba-f1be-8c1fb0a6faa0"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!openllm build NousResearch/Nous-Hermes-llama-2-7b --backend vllm --dtype float16"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "EDFu2jTG3n-Q"
|
|
},
|
|
"source": [
|
|
"### View the Bento"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"id": "3k9816Jd3rnY",
|
|
"outputId": "62ebf693-b293-4707-dd5d-53df15270938"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!openllm list-bentos"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "i2qwy9l53uso"
|
|
},
|
|
"source": [
|
|
"### Log in to BentoCloud and push the Bento\n",
|
|
"\n",
|
|
"To log in to BentoCloud and push the Bento to it, you need your BentoCloud endpoint URL and an API token. For more information, see [Manage access tokens](https://docs.bentoml.com/en/latest/bentocloud/how-tos/manage-access-token.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "MTMSjC_71Xk6"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"return_code = !bentoml cloud list-context\n",
|
|
"\n",
|
|
"if 'colab-user' not in ''.join(return_code):\n",
|
|
" # Log in to BentoCloud\n",
|
|
" endpoint = input('input endpoint (like https://xxx.cloud.bentoml.com): ')\n",
|
|
" token = input('input token (please follow https://docs.bentoml.com/en/latest/bentocloud/how-tos/manage-access-token.html#creating-an-api-token):')\n",
|
|
" !bentoml cloud login --api-token {token} --endpoint {endpoint} --context colab-user\n",
|
|
"\n",
|
|
"# Replace the Bento tag with your own\n",
|
|
"!bentoml push nousresearch--nous-hermes-llama-2-7b-service:b7c3ec54b754175e006ef75696a2ba3802697078 --context colab-user"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "7YwNbOF84Oer"
|
|
},
|
|
"source": [
|
|
"### Create a Deployment via the BentoCloud Console\n",
|
|
"\n",
|
|
"Follow this [guide](https://www.bentoml.com/blog/deploying-llama-2-7b-on-bentocloud) to deploy this Bento on BentoCloud.\n",
|
|
"\n",
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "QtzO813w9pR4"
|
|
},
|
|
"source": [
|
|
"### Create a Deployment via the BentoML client\n",
|
|
"\n",
|
|
"You can find detailed configuration in [Deployment creation and update information](https://docs.bentoml.com/en/latest/bentocloud/reference/deployment-creation-and-update-info.html).\n",
|
|
"\n",
|
|
"📢 Make sure you have logged in to BentoCloud in the last step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "WLUH1c6rcE47"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"###@title Alternatively, use the BentoML client to create a Deployment.\n",
|
|
"import json\n",
|
|
"\n",
|
|
"import bentoml\n",
|
|
"\n",
|
|
"\n",
|
|
"return_code = !bentoml cloud list-context\n",
|
|
"if 'colab-user' not in ''.join(return_code):\n",
|
|
" print('please login first!')\n",
|
|
"else:\n",
|
|
" client = bentoml.cloud.BentoCloudClient()\n",
|
|
" # runner config\n",
|
|
" runner = bentoml.cloud.Resource.for_runner(\n",
|
|
" resource_instance='starter-aws-g4dn-xlarge-gpu-t4-xlarge'\n",
|
|
" # hpa_conf={\"min_replicas\": 1, \"max_replicas\": 1},\n",
|
|
" )\n",
|
|
" # api-server hpa config\n",
|
|
" api_server = bentoml.cloud.Resource.for_api_server(resource_instance='starter-aws-t3-2xlarge-cpu-small')\n",
|
|
" hpa_conf = bentoml.cloud.Resource.for_hpa_conf(min_replicas=1, max_replicas=1)\n",
|
|
"\n",
|
|
" res = client.deployment.create(\n",
|
|
" deployment_name='test-llama2',\n",
|
|
" bento='nousresearch--nous-hermes-llama-2-7b-service:b7c3ec54b754175e006ef75696a2ba3802697078',\n",
|
|
" context='colab-user',\n",
|
|
" cluster_name='default',\n",
|
|
" # mode=\"deployment\",\n",
|
|
" kube_namespace='yatai',\n",
|
|
" runners_config={'llm-llama-runner': runner},\n",
|
|
" api_server_config=api_server,\n",
|
|
" hpa_conf=hpa_conf,\n",
|
|
" )\n",
|
|
" print(json.dump(res, indent=4))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"accelerator": "GPU",
|
|
"colab": {
|
|
"provenance": []
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.16"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
}
|