Skip to content

vLLM Docker

We support running vLLM in containerd enviroments using AWS images. you can run vLLM on LPUs in a containerd environment which makes vLLM easier to run with more stablness.

Please note that both AWS CLI access is required. contact our support team if you need help with access or image name

Requirements and Installguide

  • OS: Ubuntu 22.04 LTS, Rocky 8.4
  • Python: 3.10 ~ 3.12
  • torch: 2.7.0+cpu (in LPU only env) or 2.7.0+cu126 (in LPU+GPU env)
  • Xilinx Runtime Library
  • HyperDex-Toolchain
  • Docker install docker

Docker Image setup

Install AWS CLI

$ sudo apt install -y awscli
$ sudo apt list --installed | grep -i awscli

AWS setting

1
2
3
4
5
6
7
8
# Configure AWS CLI credentials. Please contact our support team at url below if you need help obtaining credentials.
# https://hyperaccel.atlassian.net/servicedesk/customer/portals
$ aws configure

AWS Access Key ID [None]: <AWS Access Key ID>
AWS Secret Access Key [None]: <AWS Secret Access Key>
Default region name [None]: us-east-1
Default output format [None]: json

AWS ECR login

$ aws ecr get-login-password --region us-east-1 | docker login --username <username> --password-stdin 637423205005.dkr.ecr.us-east-1.amazonaws.com

pull docker image

$ docker pull 637423205005.dkr.ecr.us-east-1.amazonaws.com/hyperdex/vllm-fpga:latest
$ docker images

Our vLLM Docker image supports both the vLLM API serving mode and the OpenAI-compatible client mode. You can choose either option depending on your application requirements.

vLLM API server

run docker

# Some models may require a Huggingface token
$ docker run --privileged --name vllm-docker \
        -e HF_TOKEN=<huggingface_token> \
        -e NUM_LPU_DEVICES=1 \
        -v /shared/huggingface:/root/.cache/huggingface \
        -v /tmp/hyperdex:/tmp/hyperdex \
        -p 8000:8000 637423205005.dkr.ecr.us-east-1.amazonaws.com/hyperdex/vllm-fpga:latest \
        python3 -m vllm.entrypoints.api_server \
        --model meta-llama/Llama-3.2-1B-Instruct \
        --port 8000

Client

To call the server, you can use OpenAI Python client library, or any other HTTP client.

$ curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -H "User-Agent: Test Client" \
  -d '{
    "prompt": "Act like an experienced HR Manager. Develop a human resources strategy for retaining top talents in a competitive industry. Industry: Energy Workforce: 550 Style: Formal Tone: Convincing",
    "n": 1,
    "temperature": 0.8,
    "max_tokens": 40,
    "top_p": 0.95,
    "top_k": 1,
    "stream": false
  }'

OpenAI Compatible server

run docker

1
2
3
4
5
6
7
8
9
# Some models may require a Huggingface token
$   docker run --privileged --name vllm-fpga \
        -e HF_TOKEN=<huggingface_token> \
        -e NUM_LPU_DEVICES=1  \
        -v /shared/huggingface:/root/.cache/huggingface  \
        -v /tmp/hyperdex:/tmp/hyperdex  \
        -p 8000:8000  \
        637423205005.dkr.ecr.us-east-1.amazonaws.com/hyperdex/vllm-fpga:latest \
        vllm serve meta-llama/Llama-3.2-1B-Instruct

Client

To call the server, you can use OpenAI Python client library, or any other HTTP client.

1
2
3
4
5
6
7
$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "prompt": "Hello, my name is",
    "max_tokens": 100,
    "temperature": 0
  }'
1
2
3
4
5
6
$ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "테스트"}],
    "max_tokens": 50
  }'