vLLM API
HyperDex supports the vLLM framework to run on LPU. As you know, the vLLM framework officially supports a variety of hardware including GPU, TPU, and XPU. HyperDex has its own branch of vLLM with a backend specifically designed for LPU, making it very easy to use. If your system is already using vLLM, you can switch hardware from GPU to LPU without changing any code.
Requirements
- OS: Ubuntu 22.04 LTS, Rocky 8.4
- Python: 3.9 ~ 3.11
- torch: 2.4.0+cpu (in LPU only env) or 2.4.0+cu121 (in LPU+GPU env)
- Xilinx Runtime Library
- HyperDex Runtime & Compiler stack
Install with pip
You can install hyperdex-vllm
using pip, which requires access rights to HyperAccel's private PyPI server. To install the HyperDex Python package, run the following command:
Text Generation with HyperAccel LPU™
HyperDex-vLLM generates tokens very similar to vLLM's generate
function, enabling you to easily generate tokens as demonstrated in the example below. Ensure that device="fpga", and num_lpu_devices=1 are set.
GPU-LPU Hybrid System
HyperDex supports a heterogeneous GPU-LPU hardware system for executing large language models (LLM). Each hardware type offers distinct strengths: GPU excels in large-scale parallel computations, while LPU is designed to fully optimize memory bandwidth utilization.
Since the prefill stage is compute-bound and the decode stage is memory-bound, the hybrid system processes prefill stage using a GPU, and decode stage using a LPU. This approach sighificantly boosts LLM performance!
To enable the hybrid system, simply add the option num_gpu_devices=1.
Option for Sampling Params
Sampling Params refer to setting that control how a model generates text. vLLM supports various sampling params to support various features. However, due to the current limitations of the LPU, only the sampling parameters listed below are supported. We are planning to extend and update our coverage.
Sampling Arguments | Description |
---|---|
max_tokens |
The number of tokens to be generated. Default value is 16 |
top_p |
Top-P sampling. Default value is 0.7 |
top_k |
Top-K sampling. Default value is 1 |
temperature |
Smoothing the logit distribution. Defualt value is 1.0 |
repetition_penalty |
Give penlaty to logits. Default value is 1.2 |
stop |
Token ID that signals the end of generation. Default value is eos_token_id |
Option for LLM Engine
LLM Engine is a core component of vLLM because it performs various functions. For example, it initializes hardware, manages hardware resources, and schedules requests.
LLM Engine Arguments | Description |
---|---|
model |
Name of path of the huggingface model to use. Default: "facebook/opt-125m" |
device |
Device type for vLLM execution. Default: "cuda" |
num_lpu_devices |
Number of LPU to compute in parallel. Default: 1 |
num_gpu_devices |
Number of GPU to compute in parallel. Default: 0 |
tokenizer |
Name of path of the huggingface tokenizer to use. If not specified, model name of path will be used |
trust-remote-code |
Trust remote code from huggingface. Default: False |