Python API
HyperDex provides a Python API designed to make running workloads on the LPU both easy and efficient. The Python API uses function calls similar to those found in HuggingFace’s transformers library, allowing users familiar with HuggingFace to quickly adapt to and utilize the HyperDex system. This enables existing HuggingFace users to seamlessly integrate HyperDex into their workflows without a steep learning curve.
Requirements
- OS: Ubuntu 22.04 LTS, Rocky 8.4
- Python: 3.9 ~ 3.11
- Xilinx Runtime Library
- HyperDex Runtime & Compiler stack
Install with pip
You can install hyperdex-transformers
using pip, which requires access rights to HyperAccel's private PyPI server. To install the HyperDex Python package, run the following command:
Text Generation with HyperAccel LPU™
HyperDex allows you to generate output tokens using a function similar to HuggingFace's generate
function. Therefore, you can easily generate tokens as shown in the example below.
LPU-GPU Hybrid System
Starting from version 1.3.2, HyperDex-Python supports the LPU-GPU hybrid system. The GPU, which has relatively higher computing power, handles the Prefill part of the Transformer, while the LPU, which efficiently utilizes memory bandwidth, processes the Decode part. The Key-Value transfer between Prefill and Decode can be performed without overhead using HyperDex's proprietary technology. You can select the number of devices to use for both GPU and LPU through the device_map
option.
Note
To run the LPU-GPU hybrid system
, you need to have CUDA 12.1
installed on your system. Additionally, since the GPU utilizes PyTorch
to run LLMs, it is recommended to install torch version 2.4.0 or later
to ensure optimal compatibility and performance.
To use the hybrid system, you need CUDA version 12.1 or later and the corresponding version of PyTorch.
Sampling
Sampling works in the same way as HuggingFace. For sampling, you have options like top_p, top_k, temperature, and repetition penalty. Please refer to the HuggingFace documentation for explanations of each option. Additionally, the generate
function allows you to directly control randomness using the seed argument. If do_smaple
is False
, LPU does not perform sampling and uses greedy method.
Sampling Arguments | Description |
---|---|
top_p |
Top-P sampling. Default value is 0.7 |
top_k |
Top-K sampling. Default value is 1 |
temperature |
Smoothing the logit distribution. Defualt value is 1.0 |
repetition_penalty |
Give penlaty to logits. Default value is 1.2 |
stop |
Token ID that signals the end of generation. Default value is eos_token_id |
Streaming Token Generation
HyperDex supports streaming token generation in a similar manner to HuggingFace. You can activate it by passing the TextStreamer module as an argument to the generate
function.
Since the TextStreamer
module includes the process of decoding through the tokenizer
and printing internally, you only need to call the generate
function.
How to use streaming with other Application?
HyperDex utilizes the yield
keyword in Python to enable streaming for use in other applications. When you call the generate_yield
function, it returns using yield
, making it easy to use in other Python applications.