vLLM Now Supports Running GGUF on AMD Radeon GPU

It’s official: we’ve successfully ported vLLM’s GGUF kernel to AMD ROCm, and the performance results are remarkable. In our benchmarks, vLLM has shown superior performance compared to Ollama on an AMD Radeon 7900XTX, even at a batch size of 1, where Ollama typically excels.

Performance Comparison on shareGPT Dataset:

Benchmark for batch size of 1:

Framework	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)
vLLM (main) on RX 7900XTX	62.66	134.48
Ollama (0.4.6) on RX 7900XTX	58.05	86.2

Hardware Specification:

CPU: AMD Ryzen Threadripper 7970X 32-Cores
GPU: AMD Radeon RX 7900XTX 24GB

Why This Matters

This breakthrough is significant for users running large language models on AMD hardware. vLLM now offers enhanced performance for GGUF inference, making it faster and more efficient than ever before.

Getting Started

Want to try it yourself? Here’s how to set up vLLM on your AMD system:

1. Install ROCm

Follow the setup steps from ROCm on Radeon GPUs.

2a. Build a Docker Image (Optional)

Follow the setup steps from Installation with ROCm — vLLM.

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm .

2b. Use the Prebuilt Docker Image

$ sudo docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   -v /path/to/hfmodels:/app/model \  # if you have pre-downloaded the model weight, else ignore
   ghcr.io/embeddedllm/vllm-rocm:navi-gguf-690c57c \
   bash

3. Download the Model

Download the model weights for Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf from bartowski/Meta-Llama-3.1-8B-Instruct-GGUF.

$ wget -O Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf "model-download-url"

# example
$ wget -O Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf"

4. Running the Model with vLLM

Launching the GGUF Model

With everything set up, you can now launch the model:

$ VLLM_RPC_TIMEOUT=30000 vllm serve ./Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --max_model_len 32768 --num-scheduler-step 1 --served_model_name llama3.1-8b-instruct-q5_K_M

Note: To launch a GGUF model using the vLLM engine, you need to supply a repo containing the correct tokenizer. Here, we use the llama-3 tokenizer from neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8.

5. Testing the Endpoint

Ensure the model is serving by testing the endpoint:

$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1-8b-instruct-q5_K_M",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'

Benchmarking vLLM

Benchmark Command

To evaluate vLLM’s performance:

cd vllm/benchmarks
python benchmark_serving.py --backend vllm --model "llama3.1-8b-instruct-q5_K_M" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 64 --max-concurrency 1 --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8

Conclusion

Feedback and Further Discussion

We’re eager to continue enhancing performance and usability. Your input is invaluable:

More benchmarks: What other tests or benchmarks should we consider? E.g., Llama.cpp Vulkan.
New features: What additional features would you like to see in vLLM?

Appendix: How to Setup Ollama

Installation steps can be found at Download Ollama on Linux.

Launch Model with Ollama

$ ollama run llama3.1:8b-instruct-q5_K_M

Benchmark Command for Ollama

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
pip install numpy datasets Pillow tqdm transformers

python benchmark_serving.py --backend openai-chat --model "llama3.1:8b-instruct-q5_K_M" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 64 --max-concurrency 1 --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --port 11434 --endpoint /v1/chat/completions