vLLM Docker
We support running vLLM in containerd enviroments using AWS images. you can run vLLM on LPUs in a containerd environment which makes vLLM easier to run with more stablness.
Please note that both AWS CLI access is required. contact our support team if you need help with access or image name
Requirements and Installguide
- OS: Ubuntu 22.04 LTS, Rocky 8.4
- Python: 3.10 ~ 3.12
- torch: 2.7.0+cpu (in LPU only env) or 2.7.0+cu126 (in LPU+GPU env)
- Xilinx Runtime Library
- HyperDex-Toolchain
- Docker install docker
Docker Image setup
Install AWS CLI
| $ sudo apt install -y awscli
$ sudo apt list --installed | grep -i awscli
|
AWS setting
| # Configure AWS CLI credentials. Please contact our support team at url below if you need help obtaining credentials.
# https://hyperaccel.atlassian.net/servicedesk/customer/portals
$ aws configure
AWS Access Key ID [None]: <AWS Access Key ID>
AWS Secret Access Key [None]: <AWS Secret Access Key>
Default region name [None]: us-east-1
Default output format [None]: json
|
AWS ECR login
| $ aws ecr get-login-password --region us-east-1 | docker login --username <username> --password-stdin 637423205005.dkr.ecr.us-east-1.amazonaws.com
|
pull docker image
| $ docker pull 637423205005.dkr.ecr.us-east-1.amazonaws.com/hyperdex/vllm-fpga:latest
$ docker images
|
Our vLLM Docker image supports both the vLLM API serving mode and the OpenAI-compatible client mode.
You can choose either option depending on your application requirements.
vLLM API server
run docker
| # Some models may require a Huggingface token
$ docker run --privileged --name vllm-docker \
-e HF_TOKEN=<huggingface_token> \
-e NUM_LPU_DEVICES=1 \
-v /shared/huggingface:/root/.cache/huggingface \
-v /tmp/hyperdex:/tmp/hyperdex \
-p 8000:8000 637423205005.dkr.ecr.us-east-1.amazonaws.com/hyperdex/vllm-fpga:latest \
python3 -m vllm.entrypoints.api_server \
--model meta-llama/Llama-3.2-1B-Instruct \
--port 8000
|
Client
To call the server, you can use OpenAI Python client library, or any other HTTP client.
| $ curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-H "User-Agent: Test Client" \
-d '{
"prompt": "Act like an experienced HR Manager. Develop a human resources strategy for retaining top talents in a competitive industry. Industry: Energy Workforce: 550 Style: Formal Tone: Convincing",
"n": 1,
"temperature": 0.8,
"max_tokens": 40,
"top_p": 0.95,
"top_k": 1,
"stream": false
}'
|
OpenAI Compatible server
run docker
| # Some models may require a Huggingface token
$ docker run --privileged --name vllm-fpga \
-e HF_TOKEN=<huggingface_token> \
-e NUM_LPU_DEVICES=1 \
-v /shared/huggingface:/root/.cache/huggingface \
-v /tmp/hyperdex:/tmp/hyperdex \
-p 8000:8000 \
637423205005.dkr.ecr.us-east-1.amazonaws.com/hyperdex/vllm-fpga:latest \
vllm serve meta-llama/Llama-3.2-1B-Instruct
|
Client
To call the server, you can use OpenAI Python client library, or any other HTTP client.
| $ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello, my name is",
"max_tokens": 100,
"temperature": 0
}'
|
| $ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "테스트"}],
"max_tokens": 50
}'
|