vLLM Now Supports Running GGUF on AMD Radeon GPU

Dec 1, 2024

2 mins read

vLLM Now Supports Running GGUF on AMD Radeon GPU

BY EmbeddedLLM Team

It’s official: we’ve successfully ported vLLM’s GGUF kernel to AMD ROCm, and the performance results are remarkable. In our benchmarks, vLLM has shown superior performance compared to Ollama on an AMD Radeon 7900XTX, even at a batch size of 1, where Ollama typically excels.

Performance Comparison on shareGPT Dataset:

Liger Kernel Full Liger Kernel Full

Benchmark for batch size of 1:

FrameworkOutput Token Throughput (tok/s)Total Token Throughput (tok/s)
vLLM (main) on RX 7900XTX62.66134.48
Ollama (0.4.6) on RX 7900XTX58.0586.2

Hardware Specification:

  • CPU: AMD Ryzen Threadripper 7970X 32-Cores
  • GPU: AMD Radeon RX 7900XTX 24GB

Why This Matters

This breakthrough is significant for users running large language models on AMD hardware. vLLM now offers enhanced performance for GGUF inference, making it faster and more efficient than ever before.

Getting Started

Want to try it yourself? Here’s how to set up vLLM on your AMD system:

1. Install ROCm

Follow the setup steps from ROCm on Radeon GPUs.

2a. Build a Docker Image (Optional)

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm .

2b. Use the Prebuilt Docker Image

$ sudo docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   -v /path/to/hfmodels:/app/model \  # if you have pre-downloaded the model weight, else ignore
   ghcr.io/embeddedllm/vllm-rocm:navi-gguf-690c57c \
   bash

3. Download the Model

$ wget -O Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf "model-download-url"

# example
$ wget -O Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf"

4. Running the Model with vLLM

Launching the GGUF Model

With everything set up, you can now launch the model:

$ VLLM_RPC_TIMEOUT=30000 vllm serve ./Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --max_model_len 32768 --num-scheduler-step 1 --served_model_name llama3.1-8b-instruct-q5_K_M

Note: To launch a GGUF model using the vLLM engine, you need to supply a repo containing the correct tokenizer. Here, we use the llama-3 tokenizer from neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8.

5. Testing the Endpoint

  • Ensure the model is serving by testing the endpoint:
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1-8b-instruct-q5_K_M",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'

Benchmarking vLLM

Benchmark Command

To evaluate vLLM’s performance:

cd vllm/benchmarks
python benchmark_serving.py --backend vllm --model "llama3.1-8b-instruct-q5_K_M" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 64 --max-concurrency 1 --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8

Conclusion

Feedback and Further Discussion

We’re eager to continue enhancing performance and usability. Your input is invaluable:

  • More benchmarks: What other tests or benchmarks should we consider? E.g., Llama.cpp Vulkan.
  • New features: What additional features would you like to see in vLLM?

Appendix: How to Setup Ollama

Launch Model with Ollama

$ ollama run llama3.1:8b-instruct-q5_K_M

Benchmark Command for Ollama

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
pip install numpy datasets Pillow tqdm transformers

python benchmark_serving.py --backend openai-chat --model "llama3.1:8b-instruct-q5_K_M" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 64 --max-concurrency 1 --tokenizer neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --port 11434 --endpoint /v1/chat/completions

EmbeddedLLM Logo

Embark your company’s journey with the next-gen AI powered platform. Get a quote now.

Legal

Terms and Conditions

Privacy Policy

Licenses

© 2023 Embedded LLM. All rights reserved.