See the Power of Llama 3.2 Vision on AMD MI300X

This blog post shows you how to run Meta’s powerful Llama 3.2-90B-Vision-Instruct model on an AMD MI300X GPU using vLLM. We provide the Docker commands, code snippets, and a video demo to help you get started with image-based prompts and experience impressive performance.

Receipt Analysis Demo

Extract structured data from receipts with ease! This video demonstrates Llama-3.2-90B-Vision-Instruct running on 8xAMD MI300X, hosted using vLLM. JamAI Base sends 20 concurrent prompts, showcasing the model’s ability to analyze receipts and extract key information like shop name, payment method, and totals.

Introduction

The release of Meta’s Llama 3.2 Vision has unlocked incredible potential for multimodal AI applications. Imagine AI that can not only understand text but also “see” and interpret images, enabling it to answer questions about visual content, generate detailed image captions, and even analyze complex documents with charts and diagrams. This is the power of Llama 3.2 Vision, and it’s now accessible to a wider audience thanks to its open-source nature and compatibility with powerful hardware like the AMD MI300X GPUs.

This video demonstrates Llama-3.2-90B-Vision-Instruct running on 8x AMD MI300X, hosted using vLLM via the ROCm/vllm fork. JamAI Base sends 16 concurrent requests to the vLLM server, showcasing its ability to extract information about characters from images.

Llama 3.2 Vision: A Brief Overview

Llama 3.2 marks a significant leap forward in AI by introducing multimodal capabilities to the Llama family. This means the model can process and understand both text and images, opening up a whole new world of applications.

At the heart of Llama 3.2 Vision lies a novel architecture that seamlessly integrates an image encoder with the language model. This image encoder transforms images into a format that the language model can understand, allowing it to reason over visual information. Think of it as giving the language model “eyes” to perceive the world.

This architecture enables Llama 3.2 Vision to tackle a wide range of tasks, including:

Image Captioning: Generating descriptive captions for images
Visual Question Answering: Answering questions about the content of images
Object Detection: Identifying and locating objects within images
Optical Character Recognition (OCR): Extracting text from images
Data Extraction: Pulling out key information from visual documents

Llama 3.2 Vision comes in two sizes:

11B parameters: A smaller, more efficient model suitable for tasks with moderate complexity
90B parameters: A larger, more powerful model capable of handling complex visual reasoning tasks

Both models support high-resolution images up to 1120x1120 pixels, allowing for detailed analysis of visual information.

AMD MI300X: The Hardware Powerhouse

Cutting-edge AI like Llama 3.2 Vision demands powerful hardware. Enter the AMD Instinct MI300X, a GPU purpose-built for high-performance computing and AI. It boasts impressive specs that make it ideal for large language models.

LLMs need vast memory capacity and bandwidth. The MI300X delivers with a massive 192GB of HBM memory, dwarfing the Nvidia H100 SXM’s 80GB. This allows it to handle Llama 3.2 Vision’s numerous parameters and store activations efficiently. Furthermore, the MI300X offers 5.3 TB/s of memory bandwidth, enabling rapid data transfer and minimizing latency for maximum performance.

Why MI300X for Llama 3.2 Vision?

The MI300X’s combination of high memory capacity, impressive bandwidth, and powerful compute makes it a perfect choice for running Llama 3.2 Vision. It provides the resources needed to handle the model’s complexity and deliver optimal performance for demanding visual AI tasks.

vLLM: Efficient Inference with ROCm

Serving large language models like Llama 3.2 Vision efficiently can be challenging. This is where vLLM comes in. vLLM is a powerful open-source library designed to optimize LLM inference, making it faster and more scalable.

ROCm Support for AMD GPUs

Thanks to the AMD vLLM team, the ROCm/vLLM fork now includes experimental cross-attention kernel support, which is crucial for running Llama 3.2 Vision on AMD MI300X GPUs. While support for Llama 3.2 Vision is still experimental due to the complexities of cross-attention, active development is underway to fully integrate it into the main vLLM project.

Benefits of vLLM for Llama 3.2 Vision

Increased Throughput: vLLM can handle many concurrent users in parallel, making it ideal for serving Llama 3.2 Vision in demanding applications
Reduced Latency: Continuous batching and paged attention minimize processing time, resulting in faster responses
Simplified Deployment: The OpenAI-compatible API and built-in image preprocessing streamline the deployment of Llama 3.2 Vision

Get Started

Ready to experience the power of Llama 3.2 Vision on AMD MI300X? Here’s how to set up your environment and start sending image requests:

1. Launch the Docker Image

sudo docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --shm-size=8g \
   --device /dev/kfd \
   --device /dev/dri \
   -e NCCL_MIN_NCHANNELS=112 \
   --ulimit memlock=2147483648 \
   --ulimit stack=2147483648 \
   ghcr.io/embeddedllm/vllm-amdfork:968345a-mllamafix \
   bash

2. Launch the Model Instance

Choose between the 11B or 90B model:

For 11B model:

HF_TOKEN=<Your-HF-Token> VLLM_USE_TRITON_FLASH_ATTN=1 vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --host 0.0.0.0 --port 8000 -tp 8 --served-model-name meta-llama/Llama-3.2-11B-Vision-Instruct

For 90B model:

HF_TOKEN=<Your-HF-Token> VLLM_USE_TRITON_FLASH_ATTN=1 vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct --host 0.0.0.0 --port 8000 -tp 8 --served-model-name meta-llama/Llama-3.2-90B-Vision-Instruct

3. Send Image Requests using Image URL

curl -XPOST -H "Content-type: application/json" -d '{
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [{
                "type": "text",
                "text": "Transcribe the image into a HTML, be very complete and do not skip any words you see."},{
            "type": "image_url",
            "image_url": {
                "url": "https://th.bing.com/th/id/OIP.EGNstT4uilQv7zcc_kIEBAHaSC?rs=1&pid=ImgDetMain"
            }
        }]
    }
  ]
}' 'http://0.0.0.0:8000/v1/chat/completions'

4. Send Image Requests using Base64-Encoded Image

First, download a sample image:

wget https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg

Here’s the Python script to send requests:

from openai import OpenAI
import base64

# Initialize the client with your local server's URL
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"  # if required by your local server
)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "./2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

# Preparing the messages
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in the image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }
]

# Making the API call
try:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-90B-Vision-Instruct",
        messages=messages,
        max_tokens=300
    )

    # Print the response
    print(response.choices[0].message.content)
except Exception as e:
    print(f"An error occurred: {e}")

Conclusion

Llama 3.2 Vision and AMD MI300X GPUs bring powerful multimodal AI capabilities within reach. This combination excels in image understanding, question answering, and document analysis. This is just the beginning of visual AI’s potential. Explore Llama 3.2 Vision on AMD MI300X and be a part of this exciting future!

Acknowledgement

We would like to thank Hot Aisle Inc. for sponsoring the AMD MI300X used in this project. If you’re interested in running your own LLM models on the visually stunning “The Switch Pyramid” top-tier data center in Michigan, be sure to contact Hot Aisle Inc.