How to Build vLLM on MI300X from Source

Oct 11, 2024

8 mins read

How to Build vLLM on MI300X from Source

BY EmbeddedLLM Team

This guide walks you through the process of building vLLM from source on AMD MI300X. The build has been verified for ROCm 6.2.

Prerequisites

  • CMake version > 3.26.0

    • Download .sh cmake file from Download CMake
    • Example:
      wget https://github.com/Kitware/CMake/releases/download/v3.30.5/cmake-3.30.5-linux-aarch64.sh
      bash cmake-3.30.5-linux-aarch64.sh
      
    • Add to PATH:
      1. Open ~/.bashrc in a text editor
      2. Add at the end: export PATH="/path/to/cmake/bin:$PATH"
      3. Replace /path/to/cmake with the actual path
    • Verify installation: cmake --version (ensure version > 3.26.0)
  • hipBLASLt: sudo apt install hipblaslt

  • RCCL: sudo apt install rccl

  • (Recommended) Anaconda or python-venv environment

Building vLLM from source

After you get the basic requisites setup, you can follow the following block of commands to start building vLLM from source.

There are a few things going on in the block of commands:

  1. Installation of newer Pytorch (2.6.0-dev)
  2. Installation of CK Flash Attention
  3. Installation of amd_smi package
  4. Installation of vLLM from source

Recommend isolating the python environment using venv or anaconda before running the following

Sources:

In the future, you may want to consider learning how to build vLLM from source by examining the Dockerfile, as it serves as a form of documentation

The following build is verified for ROCm 6.2 only

## Source (A)
# export PYTORCH_ROCM_ARCH="gfx908;gfx90a;gfx942;gfx1100"
# To save time enable only your required build e.g. on MI300x
export PYTORCH_ROCM_ARCH="gfx942"

# export FA_GFX_ARCHS="gfx90a;gfx942"
# To save time enable only your required build e.g. on MI300x
export FA_GFX_ARCHS="gfx942"

export FA_BRANCH="3cea2fb"

## Increasing the number of file handlers will resolve the following two issues
### Issue 1: Resolve https://github.com/pytorch/pytorch/issues/137695
# rocblaslt error: Could not load /home/hotaisle/anaconda3/envs/vllm062rocm622/lib/python3.9/site-packages/torch/lib/hipblaslt/library/TensileLibrary_lazy_gfx942.dat
### Issue 2: Depending on your system configuration, in general you might want to increase the number of file handlers if you are going to serve a lot of http requests e.g. 1000
# Run the ulimit command before you launch vLLM instance
ulimit -n 131072

# Install a pre-released version
python3 -m pip install --pre \
                torch==2.6.0.dev20240918 \
                setuptools-scm>=8 \
                torchvision==0.20.0.dev20240918 \
                --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2

git clone https://github.com/vllm-project/vllm
# Optional if you found the upstream broken and would like to install vLLM v0.6.2:
# git checkout cb3b2b9ba4a95c413a879e30e2b8674187519a93
cd vllm
python3 -m pip install -Ur requirements-rocm.txt

## Install CK Flash Attention (recommended) it is a lot faster than triton flash attention
cd ../
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
git checkout "${FA_BRANCH}"
git submodule update --init
# To speed up the compilation
python3 -m pip install ninja
GPU_ARCHS="${FA_GFX_ARCHS}" python3 setup.py install
cd ../

## Install AMD_SMI pypi package
cp -r /opt/rocm/share/amd_smi ./
cd amd_smi
sudo rm -rf amdsmi.egg-info/
python3 -m pip install .
cd ../

## Source (B)
conda install ccache
# or apt install ccache
python3 setup.py develop

## Some AMD Specific optimization
sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
export NCCL_MIN_NCHANNELS=112

## Serve your model
# If you would like to use CK Flash Attention Backend
export VLLM_USE_TRITON_FLASH_ATTN=0
# To use triton backend
# export VLLM_USE_TRITON_FLASH_ATTN=1

vllm serve meta-llama/Llama-3.1-70B-Instruct --host 0.0.0.0 --port 8000 -tp 4 --max-num-seqs 1024 --max-seq-len-to-capture 16384 --served-model-name meta-llama/Meta-Llama-3.1-70B-Instruct --enable-chunked-prefill=False --num-scheduler-step 15 --max-num-seqs 512

How we derived these steps

How did we come up with all these steps and ensure we are always able to build vLLM from source even if the documentation is not up-to-date? Search for Dockerfiles.

Question to Claude: Why do we need a Dockerfile? What are the benefits of Dockerfile?

Response from Claude: Dockerfiles are essential in containerization for several reasons:

    1. Reproducibility: Dockerfiles provide a consistent way to build container images, ensuring that the same environment can be recreated across different systems.
    1. Automation: They automate the process of creating container images, reducing manual steps and potential human errors.

”Reproducibility”. And thus, Dockerfile actually contains a lot of vital information on the installation steps.

As you have seen in the citation of the resources in the command block while you are building vLLM from source. They are all extracted from Source (A) and (B) based on their resourcefulness.

So you might want to look into in the future to figure out how to build vLLM from source by reading through the content of the Dockerfile as they are a documentation in disguise.

Additional Notes

If you encounter the following error on MI300X Platform:

INFO 10-14 09:18:43 selector.py:121] Using ROCmFlashAttention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/109 [00:00<?, ?it/s]                                                                      Loading safetensors checkpoint shards: 100% Completed | 109/109 [00:00<00:00, 2575.95it/s]

(VllmWorkerProcess pid=2475850) INFO 10-14 09:19:21 model_runner.py:1060] Loading model weights took 56.7677 GB                                     (VllmWorkerProcess pid=2475856) INFO 10-14 09:19:21 model_runner.py:1060] Loading model weights took 56.7677 GB                                     (VllmWorkerProcess pid=2475852) INFO 10-14 09:19:21 model_runner.py:1060] Loading model weights took 56.7677 GB                                     (VllmWorkerProcess pid=2475854) INFO 10-14 09:19:22 model_runner.py:1060] Loading model weights took 56.7677 GB                                     (VllmWorkerProcess pid=2475851) INFO 10-14 09:19:22 model_runner.py:1060] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=2475849) INFO 10-14 09:19:22 model_runner.py:1060] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=2475853) INFO 10-14 09:19:22 model_runner.py:1060] Loading model weights took 56.7677 GB
INFO 10-14 09:19:22 model_runner.py:1060] Loading model weights took 56.7677 GB

rocblaslt error: Could not load /home/hotaisle/anaconda3/envs/vllm062rocm622/lib/python3.9/site-packages/torch/lib/hipblaslt/library/TensileLibrary_lazy_gfx942.dat

rocblaslt error: Could not load /home/hotaisle/anaconda3/envs/vllm062rocm622/lib/python3.9/site-packages/torch/lib/hipblaslt/library/TensileLibrary_lazy_gfx942.dat                                                                                                                                                                                                                                                                                         rocblaslt error: Could not load /home/hotaisle/anaconda3/envs/vllm062rocm622/lib/python3.9/site-packages/torch/lib/hipblaslt/library/TensileLibrary_lazy_gfx942.dat

You can try increasing the open file handlers count. ulimit -n 131072 |

If it does not work, you can checkout some of the resources and the issues:

Additional Information About Our Environment

To the best of our effort we try to list out all the relevant environments and packages in our system for the ease of your debugging.

amd-smi
$ amd-smi version
/opt/rocm-6.2.2/libexec/amdsmi_cli/BDF.py:126: SyntaxWarning: invalid escape sequence '\.'
  bdf_regex = "(?:[0-6]?[0-9a-fA-F]{1,4}:)?[0-2]?[0-9a-fA-F]{1,2}:[0-9a-fA-F]{1,2}\.[0-7]"
AMDSMI Tool: 24.6.3+52b3947 | AMDSMI Library version: 24.6.3.0 | ROCm version: 6.2.2
apt packages
$ apt list --installed | grep -E 'hip|rocm|roc|amdgpu'

amd64-microcode/jammy-updates,jammy-security,now 3.20191218.1ubuntu2.2 amd64 [installed,automatic]
amdgpu-core/jammy,now 1:6.2.60202-2041575.22.04 all [installed,automatic]
amdgpu-dkms-firmware/jammy,now 1:6.8.5.60202-2041575.22.04 all [installed,automatic]
amdgpu-dkms/jammy,now 1:6.8.5.60202-2041575.22.04 all [installed]
amdgpu-install/jammy,now 6.2.60202-2041575.22.04 all [installed]
hip-dev/jammy,now 6.2.41134.60202-116~22.04 amd64 [installed,auto-removable]
hip-doc/jammy,now 6.2.41134.60202-116~22.04 amd64 [installed,auto-removable]
hip-runtime-amd/jammy,now 6.2.41134.60202-116~22.04 amd64 [installed,automatic]
hip-samples/jammy,now 6.2.41134.60202-116~22.04 amd64 [installed,auto-removable]
hipblas-dev/jammy,now 2.2.0.60202-116~22.04 amd64 [installed,automatic]
hipblas/jammy,now 2.2.0.60202-116~22.04 amd64 [installed,automatic]
hipblaslt-dev/jammy,now 0.8.0.60202-116~22.04 amd64 [installed,automatic]
hipblaslt/jammy,now 0.8.0.60202-116~22.04 amd64 [installed]
hipcc/jammy,now 1.1.1.60202-116~22.04 amd64 [installed,auto-removable]
hipcub-dev/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,auto-removable]
hipfft-dev/jammy,now 1.0.15.60202-116~22.04 amd64 [installed,auto-removable]
hipfft/jammy,now 1.0.15.60202-116~22.04 amd64 [installed,auto-removable]
hipfort-dev/jammy,now 0.4.0.60202-116~22.04 amd64 [installed,auto-removable]
hipify-clang/jammy,now 18.0.0.60202-116~22.04 amd64 [installed,auto-removable]
hiprand-dev/jammy,now 2.11.0.60202-116~22.04 amd64 [installed,auto-removable]
hiprand/jammy,now 2.11.0.60202-116~22.04 amd64 [installed,auto-removable]
hipsolver-dev/jammy,now 2.2.0.60202-116~22.04 amd64 [installed,auto-removable]
hipsolver/jammy,now 2.2.0.60202-116~22.04 amd64 [installed,auto-removable]
hipsparse-dev/jammy,now 3.1.1.60202-116~22.04 amd64 [installed,auto-removable]
hipsparse/jammy,now 3.1.1.60202-116~22.04 amd64 [installed,auto-removable]
hipsparselt-dev/jammy,now 0.2.1.60202-116~22.04 amd64 [installed,auto-removable]
hipsparselt/jammy,now 0.2.1.60202-116~22.04 amd64 [installed,auto-removable]
hiptensor-dev/jammy,now 1.3.0.60202-116~22.04 amd64 [installed,auto-removable]
hiptensor/jammy,now 1.3.0.60202-116~22.04 amd64 [installed,auto-removable]
hsa-rocr-dev/jammy,now 1.14.0.60202-116~22.04 amd64 [installed,auto-removable]
hsa-rocr/jammy,now 1.14.0.60202-116~22.04 amd64 [installed,automatic]
hsakmt-roct-dev/jammy,now 20240607.4.05.60202-116~22.04 amd64 [installed,auto-removable]
intel-microcode/jammy-updates,jammy-security,now 3.20240910.0ubuntu0.22.04.1 amd64 [installed,automatic]
libdrm-amdgpu-amdgpu1/jammy,now 1:2.4.120.60202-2041575.22.04 amd64 [installed,automatic]
libdrm-amdgpu-common/jammy,now 1.0.0.60202-2041575.22.04 all [installed,automatic]
libdrm-amdgpu-dev/jammy,now 1:2.4.120.60202-2041575.22.04 amd64 [installed,auto-removable]
libdrm-amdgpu-radeon1/jammy,now 1:2.4.120.60202-2041575.22.04 amd64 [installed,auto-removable]
libdrm-amdgpu1/jammy-updates,now 2.4.113-2~ubuntu0.22.04.1 amd64 [installed,automatic]
libdrm2-amdgpu/jammy,now 1:2.4.120.60202-2041575.22.04 amd64 [installed,automatic]
libllvm18.1-amdgpu/jammy,now 1:18.1.60202-2041575.22.04 amd64 [installed,auto-removable]
libpostproc55/jammy-updates,jammy-security,now 7:4.4.2-0ubuntu0.22.04.1 amd64 [installed,auto-removable]
libproc-processtable-perl/jammy,now 0.634-1build1 amd64 [installed,automatic]
libprocps8/jammy-updates,jammy-security,now 2:3.3.17-6ubuntu2.1 amd64 [installed,automatic]
libva-amdgpu-drm2/jammy,now 2.16.0.60202-2041575.22.04 amd64 [installed,auto-removable]
libva-amdgpu-wayland2/jammy,now 2.16.0.60202-2041575.22.04 amd64 [installed,auto-removable]
libva-amdgpu-x11-2/jammy,now 2.16.0.60202-2041575.22.04 amd64 [installed,auto-removable]
libva2-amdgpu/jammy,now 2.16.0.60202-2041575.22.04 amd64 [installed,auto-removable]
libwayland-amdgpu-client0/jammy,now 1.22.0.60202-2041575.22.04 amd64 [installed,auto-removable]
mesa-amdgpu-va-drivers/jammy,now 1:24.2.0.60202-2041575.22.04 amd64 [installed,auto-removable]
miopen-hip-dev/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,auto-removable]
miopen-hip/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,auto-removable]
procps/jammy-updates,jammy-security,now 2:3.3.17-6ubuntu2.1 amd64 [installed,automatic]
python3-ptyprocess/jammy,now 0.7.0-3 all [installed,automatic]
rocalution-dev/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,auto-removable]
rocalution/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,auto-removable]
rocblas-dev/jammy,now 4.2.1.60202-116~22.04 amd64 [installed,automatic]
rocblas/jammy,now 4.2.1.60202-116~22.04 amd64 [installed,automatic]
rocdecode-dev/jammy,now 0.6.0.60202-116~22.04 amd64 [installed,auto-removable]
rocdecode/jammy,now 0.6.0.60202-116~22.04 amd64 [installed,auto-removable]
rocfft-dev/jammy,now 1.0.29.60202-116~22.04 amd64 [installed,auto-removable]
rocfft/jammy,now 1.0.29.60202-116~22.04 amd64 [installed,auto-removable]
rocm-cmake/jammy,now 0.13.0.60202-116~22.04 amd64 [installed,auto-removable]
rocm-core/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,automatic]
rocm-dbgapi/jammy,now 0.76.0.60202-116~22.04 amd64 [installed,auto-removable]
rocm-debug-agent/jammy,now 2.0.3.60202-116~22.04 amd64 [installed,auto-removable]
rocm-developer-tools/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-device-libs/jammy,now 1.0.0.60202-116~22.04 amd64 [installed,auto-removable]
rocm-gdb/jammy,now 14.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-hip-runtime-dev/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-hip-runtime/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-language-runtime/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-llvm/jammy,now 18.0.0.24355.60202-116~22.04 amd64 [installed,auto-removable]
rocm-opencl-dev/jammy,now 2.0.0.60202-116~22.04 amd64 [installed,auto-removable]
rocm-opencl-icd-loader/jammy,now 1.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-opencl-runtime/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-opencl-sdk/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-opencl/jammy,now 2.0.0.60202-116~22.04 amd64 [installed,auto-removable]
rocm-openmp-sdk/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-smi-lib/jammy,now 7.3.0.60202-116~22.04 amd64 [installed,automatic]
rocm-utils/jammy,now 6.2.2.60202-116~22.04 amd64 [installed,auto-removable]
rocm-validation-suite/jammy,now 1.0.60202.60202-116~22.04 amd64 [installed]
rocminfo/jammy,now 1.0.0.60202-116~22.04 amd64 [installed,automatic]
rocprim-dev/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,auto-removable]
rocprofiler-dev/jammy,now 2.0.60202.60202-116~22.04 amd64 [installed,auto-removable]
rocprofiler-plugins/jammy,now 2.0.60202.60202-116~22.04 amd64 [installed,auto-removable]
rocprofiler-register/jammy,now 0.4.0.60202-116~22.04 amd64 [installed,automatic]
rocprofiler-sdk-roctx/jammy,now 0.4.0-116~22.04 amd64 [installed,auto-removable]
rocprofiler-sdk/jammy,now 0.4.0-116~22.04 amd64 [installed,auto-removable]
rocprofiler/jammy,now 2.0.60202.60202-116~22.04 amd64 [installed,auto-removable]
rocrand-dev/jammy,now 3.1.0.60202-116~22.04 amd64 [installed,auto-removable]
rocrand/jammy,now 3.1.0.60202-116~22.04 amd64 [installed,auto-removable]
rocsolver-dev/jammy,now 3.26.0.60202-116~22.04 amd64 [installed,automatic]
rocsolver/jammy,now 3.26.0.60202-116~22.04 amd64 [installed,automatic]
rocsparse-dev/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,automatic]
rocsparse/jammy,now 3.2.0.60202-116~22.04 amd64 [installed,automatic]
rocthrust-dev/jammy,now 3.1.0.60202-116~22.04 amd64 [installed,auto-removable]
roctracer-dev/jammy,now 4.1.60202.60202-116~22.04 amd64 [installed,auto-removable]
roctracer/jammy,now 4.1.60202.60202-116~22.04 amd64 [installed,auto-removable]
rocwmma-dev/jammy,now 1.5.0.60202-116~22.04 amd64 [installed,auto-removable]
whiptail/jammy,now 0.52.21-5ubuntu2 amd64 [installed,automatic]
vLLM environment
$ python collect_env.py

Collecting environment information...
WARNING 10-15 06:26:04 rocm.py:13] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
PyTorch version: 2.6.0.dev20240918+rocm6.2
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.2.41133-dd7f95766

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.9.20 (main, Oct  3 2024, 07:27:41)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.2.41133
MIOpen runtime version: 3.2.0
Is XNNPACK available: True

...

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pytorch-triton-rocm==3.1.0+5fe38ffd73
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0.dev20240918+rocm6.2
[pip3] torchvision==0.20.0.dev20240918+rocm6.2
[pip3] transformers==4.45.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch-triton-rocm       3.1.0+5fe38ffd73          pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.6.0.dev20240918+rocm6.2          pypi_0    pypi
[conda] torchvision               0.20.0.dev20240918+rocm6.2          pypi_0    pypi
[conda] transformers              4.45.1                   pypi_0    pypi
ROCM Version: 6.2.41134-65d174c3e
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev109+gcb3b2b9b
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

...

Acknowledgement

We would like to thank Hot Aisles Inc. for sponsoring MI300X.


EmbeddedLLM Logo

Embark your company’s journey with the next-gen AI powered platform. Get a quote now.

Legal

Terms and Conditions

Privacy Policy

Licenses

© 2023 Embedded LLM. All rights reserved.