This blog post shows you how to run Meta’s powerful Llama 3.2-90B-Vision-Instruct model on an AMD MI300X GPU using vLLM. We provide the Docker commands, code snippets, and a video demo to help you get started with image-based prompts and experience impressive performance.
Receipt Analysis Demo
Extract structured data from receipts with ease! This video demonstrates Llama-3.2-90B-Vision-Instruct running on 8xAMD MI300X, hosted using vLLM. JamAI Base sends 20 concurrent prompts, showcasing the model’s ability to analyze receipts and extract key information like shop name, payment method, and totals.
Introduction
The release of Meta’s Llama 3.2 Vision has unlocked incredible potential for multimodal AI applications. Imagine AI that can not only understand text but also “see” and interpret images, enabling it to answer questions about visual content, generate detailed image captions, and even analyze complex documents with charts and diagrams. This is the power of Llama 3.2 Vision, and it’s now accessible to a wider audience thanks to its open-source nature and compatibility with powerful hardware like the AMD MI300X GPUs.
This video demonstrates Llama-3.2-90B-Vision-Instruct running on 8x AMD MI300X, hosted using vLLM via the ROCm/vllm fork. JamAI Base sends 16 concurrent requests to the vLLM server, showcasing its ability to extract information about characters from images.
Llama 3.2 Vision: A Brief Overview
Llama 3.2 marks a significant leap forward in AI by introducing multimodal capabilities to the Llama family. This means the model can process and understand both text and images, opening up a whole new world of applications.
At the heart of Llama 3.2 Vision lies a novel architecture that seamlessly integrates an image encoder with the language model. This image encoder transforms images into a format that the language model can understand, allowing it to reason over visual information. Think of it as giving the language model “eyes” to perceive the world.
This architecture enables Llama 3.2 Vision to tackle a wide range of tasks, including:
- Image Captioning: Generating descriptive captions for images
- Visual Question Answering: Answering questions about the content of images
- Object Detection: Identifying and locating objects within images
- Optical Character Recognition (OCR): Extracting text from images
- Data Extraction: Pulling out key information from visual documents
Llama 3.2 Vision comes in two sizes:
- 11B parameters: A smaller, more efficient model suitable for tasks with moderate complexity
- 90B parameters: A larger, more powerful model capable of handling complex visual reasoning tasks
Both models support high-resolution images up to 1120x1120 pixels, allowing for detailed analysis of visual information.
AMD MI300X: The Hardware Powerhouse
Cutting-edge AI like Llama 3.2 Vision demands powerful hardware. Enter the AMD Instinct MI300X, a GPU purpose-built for high-performance computing and AI. It boasts impressive specs that make it ideal for large language models.
LLMs need vast memory capacity and bandwidth. The MI300X delivers with a massive 192GB of HBM memory, dwarfing the Nvidia H100 SXM’s 80GB. This allows it to handle Llama 3.2 Vision’s numerous parameters and store activations efficiently. Furthermore, the MI300X offers 5.3 TB/s of memory bandwidth, enabling rapid data transfer and minimizing latency for maximum performance.
Why MI300X for Llama 3.2 Vision?
The MI300X’s combination of high memory capacity, impressive bandwidth, and powerful compute makes it a perfect choice for running Llama 3.2 Vision. It provides the resources needed to handle the model’s complexity and deliver optimal performance for demanding visual AI tasks.
vLLM: Efficient Inference with ROCm
Serving large language models like Llama 3.2 Vision efficiently can be challenging. This is where vLLM comes in. vLLM is a powerful open-source library designed to optimize LLM inference, making it faster and more scalable.
ROCm Support for AMD GPUs
Thanks to the AMD vLLM team, the ROCm/vLLM fork now includes experimental cross-attention kernel support, which is crucial for running Llama 3.2 Vision on AMD MI300X GPUs. While support for Llama 3.2 Vision is still experimental due to the complexities of cross-attention, active development is underway to fully integrate it into the main vLLM project.
Benefits of vLLM for Llama 3.2 Vision
- Increased Throughput: vLLM can handle many concurrent users in parallel, making it ideal for serving Llama 3.2 Vision in demanding applications
- Reduced Latency: Continuous batching and paged attention minimize processing time, resulting in faster responses
- Simplified Deployment: The OpenAI-compatible API and built-in image preprocessing streamline the deployment of Llama 3.2 Vision
Get Started
Ready to experience the power of Llama 3.2 Vision on AMD MI300X? Here’s how to set up your environment and start sending image requests:
1. Launch the Docker Image
sudo docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--shm-size=8g \
--device /dev/kfd \
--device /dev/dri \
-e NCCL_MIN_NCHANNELS=112 \
--ulimit memlock=2147483648 \
--ulimit stack=2147483648 \
ghcr.io/embeddedllm/vllm-amdfork:968345a-mllamafix \
bash
2. Launch the Model Instance
Choose between the 11B or 90B model:
For 11B model:
HF_TOKEN=<Your-HF-Token> VLLM_USE_TRITON_FLASH_ATTN=1 vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --host 0.0.0.0 --port 8000 -tp 8 --served-model-name meta-llama/Llama-3.2-11B-Vision-Instruct
For 90B model:
HF_TOKEN=<Your-HF-Token> VLLM_USE_TRITON_FLASH_ATTN=1 vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct --host 0.0.0.0 --port 8000 -tp 8 --served-model-name meta-llama/Llama-3.2-90B-Vision-Instruct
3. Send Image Requests using Image URL
curl -XPOST -H "Content-type: application/json" -d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [{
"type": "text",
"text": "Transcribe the image into a HTML, be very complete and do not skip any words you see."},{
"type": "image_url",
"image_url": {
"url": "https://th.bing.com/th/id/OIP.EGNstT4uilQv7zcc_kIEBAHaSC?rs=1&pid=ImgDetMain"
}
}]
}
]
}' 'http://0.0.0.0:8000/v1/chat/completions'
4. Send Image Requests using Base64-Encoded Image
First, download a sample image:
wget https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
Here’s the Python script to send requests:
from openai import OpenAI
import base64
# Initialize the client with your local server's URL
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY" # if required by your local server
)
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Path to your image
image_path = "./2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
# Getting the base64 string
base64_image = encode_image(image_path)
# Preparing the messages
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in the image?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
]
# Making the API call
try:
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-90B-Vision-Instruct",
messages=messages,
max_tokens=300
)
# Print the response
print(response.choices[0].message.content)
except Exception as e:
print(f"An error occurred: {e}")
Conclusion
Llama 3.2 Vision and AMD MI300X GPUs bring powerful multimodal AI capabilities within reach. This combination excels in image understanding, question answering, and document analysis. This is just the beginning of visual AI’s potential. Explore Llama 3.2 Vision on AMD MI300X and be a part of this exciting future!
Acknowledgement
We would like to thank Hot Aisle Inc. for sponsoring the AMD MI300X used in this project. If you’re interested in running your own LLM models on the visually stunning “The Switch Pyramid” top-tier data center in Michigan, be sure to contact Hot Aisle Inc.