Branch: serve_k8s_playground

History

Zhanghao Wu b32a947a09 [LLM] Update sglang logo (#3170 ) update sglang logo		3 months ago
..
README.md	[LLM] Update sglang logo (#3170)	3 months ago

llama2.yaml	Add LLaVA example in SGLang demo (#3145)	3 months ago

llava.yaml	Add LLaVA example in SGLang demo (#3145)	3 months ago

README.md

SGLang: Fast and Expressive LLM Inference with RadixAttention for 5x throughput

SGLang: Fast and Expressive LLM Inference with RadixAttention for 5x throughput

SGLang

This README contains instructions to run a demo for SGLang, an open-source library for fast and expressive LLM inference and serving with 5x throughput.

Repo
Blog

Prerequisites

Install the latest SkyPilot and check your setup of the cloud credentials:

pip install "skypilot-nightly[all]"
sky check

Serving vision-language model LLaVA with SGLang for more traffic using SkyServe

Create a SkyServe Service YAML with a service section:

service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe: /health
  # How many replicas to manage.
  replicas: 2

The entire Service YAML can be found here: llava.yaml.

Start serving by using SkyServe CLI:

sky serve up -n sglang-llava llava.yaml

Use sky serve status to check the status of the serving:

sky serve status sglang-llava

You should get a similar output as the following:

Services
NAME          VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
sglang-llava  1        8m 16s  READY   2/2       34.32.43.41:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP              LAUNCHED     RESOURCES          STATUS  REGION
sglang-llava  1   1        34.85.154.76    16 mins ago  1x GCP({'L4': 1})  READY   us-east4
sglang-llava  2   1        34.145.195.253  16 mins ago  1x GCP({'L4': 1})  READY   us-east4

Check the endpoint of the service:

ENDPOINT=$(sky serve status --endpoint sglang-llava)

Once it status is READY, you can use the endpoint to talk to the model with both text and image inputs:

curl -L $ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "liuhaotian/llava-v1.6-vicuna-7b",
    "messages": [
      {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/quick_start/images/cat.jpeg"
                }
            }
        ]
      }
    ]
  }'

You should get a similar response as the following:

{
  "id": "b044d5f637694d3bba30a2d784441c6c",
  "object": "chat.completion",
  "created": 1707565348,
  "model": "liuhaotian/llava-v1.6-vicuna-7b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": " This is an image of a cute, anthropomorphized cat character."
    },
    "finish_reason": null
  }],
  "usage": {
    "prompt_tokens": 2188,
    "total_tokens": 2204,
    "completion_tokens": 16
  }
}

Serving Llama-2 with SGLang for more traffic using SkyServe

The process is the same as serving LLaVA, but with the model path changed to Llama-2. Below are example commands for reference.
Start serving by using SkyServe CLI:

sky serve up -n sglang-llama2 llama2.yaml --env HF_TOKEN=<your-huggingface-token>

The entire Service YAML can be found here: llama2.yaml.

Use sky serve status to check the status of the serving:

sky serve status sglang-llama2

You should get a similar output as the following:

Services
NAME           VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
sglang-llama2  1        8m 16s  READY   2/2       34.32.43.41:30001

Service Replicas
SERVICE_NAME   ID  VERSION  IP              LAUNCHED     RESOURCES          STATUS  REGION
sglang-llama2  1   1        34.85.154.76    16 mins ago  1x GCP({'L4': 1})  READY   us-east4
sglang-llama2  2   1        34.145.195.253  16 mins ago  1x GCP({'L4': 1})  READY   us-east4

Check the endpoint of the service:

ENDPOINT=$(sky serve status --endpoint sglang-llama2)

Once it status is READY, you can use the endpoint to interact with the model:

curl -L $ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }'

You should get a similar response as the following:

{
  "id": "cmpl-879a58992d704caf80771b4651ff8cb6",
  "object": "chat.completion",
  "created": 1692650569,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": " Hello! I'm just an AI assistant, here to help you"
    },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 47,
    "completion_tokens": 16
  }
}

No Description

Python SVG Shell Markdown HTML other

zhanghao.wu@outlook.com zongheng.y@gmail.com romil.bhardwaj@gmail.com concretevitamin@users.noreply.github.com cblmemo@gmail.com infwinston@gmail.com gautam@mittal.net romil.bhardwaj@berkeley.edu lsf@berkeley.edu suquark@gmail.com michael.luo@berkeley.edu woosuk.kwon@berkeley.edu 34902420+landscapepainter@users.noreply.github.com weichiang@berkeley.edu michaelluo@dhcp-132-50.EECS.Berkeley.EDU ziming.mao@yale.edu isaacong.jw@gmail.com sumanthgenz@gmail.com edwardzeng@berkeley.edu hysunhe@foxmail.com michaelluo@MacBook-Pro.local michael.luo123456789@gmail.com rahejamehul@gmail.com guoxd@jihulab.com 46831164+ewzeng@users.noreply.github.com

How to access data resources in code

README.md

SGLang: Fast and Expressive LLM Inference with RadixAttention for 5x throughput

Prerequisites

Serving vision-language model LLaVA with SGLang for more traffic using SkyServe

Serving Llama-2 with SGLang for more traffic using SkyServe

Contributors (25+) All

Contributors (25+)
All