It’s official: we’ve successfully ported vLLM’s GGUF kernel to AMD ROCm, and the performance results are remarkable. In our benchmarks, vLLM has shown superior performance compared to Ollama on an AMD Radeon 7900XTX, even at a batch size of 1, where Ollama typically excels.
Performance Comparison on shareGPT Dataset:
Benchmark for batch size of 1:
Framework | Output Token Throughput (tok/s) | Total Token Throughput (tok/s) |
---|---|---|
vLLM (main) on RX 7900XTX | 62.66 | 134.48 |
Ollama (0.4.6) on RX 7900XTX | 58.05 | 86.2 |
Hardware Specification:
- CPU: AMD Ryzen Threadripper 7970X 32-Cores
- GPU: AMD Radeon RX 7900XTX 24GB
Why This Matters
This breakthrough is significant for users running large language models on AMD hardware. vLLM now offers enhanced performance for GGUF inference, making it faster and more efficient than ever before.
Getting Started
Want to try it yourself? Here’s how to set up vLLM on your AMD system:
1. Install ROCm
Follow the setup steps from ROCm on Radeon GPUs.
2a. Build a Docker Image (Optional)
- Follow the setup steps from Installation with ROCm — vLLM.
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm .
2b. Use the Prebuilt Docker Image
$ sudo docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v /path/to/hfmodels:/app/model \ # if you have pre-downloaded the model weight, else ignore
ghcr.io/embeddedllm/vllm-rocm:navi-gguf-690c57c \
bash
3. Download the Model
- Download the model weights for
Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
from bartowski/Meta-Llama-3.1-8B-Instruct-GGUF.
$ wget -O Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf "model-download-url"
# example
$ wget -O Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf"
4. Running the Model with vLLM
Launching the GGUF Model
With everything set up, you can now launch the model:
$ VLLM_RPC_TIMEOUT=30000 vllm serve ./Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --max_model_len 32768 --num-scheduler-step 1 --served_model_name llama3.1-8b-instruct-q5_K_M
Note: To launch a GGUF model using the vLLM engine, you need to supply a repo containing the correct tokenizer. Here, we use the llama-3
tokenizer from neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
.
5. Testing the Endpoint
- Ensure the model is serving by testing the endpoint:
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1-8b-instruct-q5_K_M",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
Benchmarking vLLM
Benchmark Command
To evaluate vLLM’s performance:
cd vllm/benchmarks
python benchmark_serving.py --backend vllm --model "llama3.1-8b-instruct-q5_K_M" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 64 --max-concurrency 1 --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
Conclusion
Feedback and Further Discussion
We’re eager to continue enhancing performance and usability. Your input is invaluable:
- More benchmarks: What other tests or benchmarks should we consider? E.g., Llama.cpp Vulkan.
- New features: What additional features would you like to see in vLLM?
Appendix: How to Setup Ollama
- Installation steps can be found at Download Ollama on Linux.
Launch Model with Ollama
$ ollama run llama3.1:8b-instruct-q5_K_M
Benchmark Command for Ollama
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
pip install numpy datasets Pillow tqdm transformers
python benchmark_serving.py --backend openai-chat --model "llama3.1:8b-instruct-q5_K_M" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 64 --max-concurrency 1 --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --port 11434 --endpoint /v1/chat/completions