vLLM Server

Requirements and Install Guide

Requirements and Install Guide is same as vLLM API.

Serving Model

hyperdex-vllm provides an HTTP server that implements vLLM API. You can execute the server by the command below.

$ NUM_LPU_DEVICES=1 python -m vllm.entrypoints.api_server --model  facebook/opt-1.3b

... OMISSION ...

INFO:     Started server process [27157]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Descriptions of HyperDex-vLLM Serve Arguments

Arguments are the same as vLLM Engine.

Client

To call the server, run this example command at another terminal

curl http://localhost:8000/generate \ 
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "Hello, my name is",
      "max_tokens": 30,
      "temperature": 0
  }'

OpenAI Client

Serving Model

Our vLLM also provides an HTTP server that implements OpenAI's Completions API. You can call the server by the command below.

$ NUM_GPU_DEVICES=1 NUM_LPU_DEVICES=2 vllm serve facebook/opt-1.3b

... OMISSION ...

INFO:     Started server process [27157]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Client

To call the server, you can use OpenAI Python client library, or any other HTTP client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the ^Crld Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]
  }'