IEEE CAI 2024 - GenAI on Any Device: AccelerateLLMs with Triton, DirectML, and Declarative RAG

On June 25, 2024, at the IEEE Conference on Artificial Intelligence (CAI), the Embedded LLM team, together with AMD, hosted a hands-on workshop titled “Gen AI on Any Device: Accelerate LLMs with Triton, DirectML, andDeclarative RAG.”

The 3.5-hour session was packed with practical demonstrations, hands-on coding, and real-world case studies, giving participants a unique chance to explore how LLMs can be accelerated and deployed across a wide range of devices — from powerful data center GPUs to everyday Windows laptops. With an enthusiastic and engaged audience of AI practitioners, researchers, and engineers, the workshop delivered both technical depth and applied insights, making it one of the standout sessions of the conference.

🔑 Key Highlights

Triton Mastery: GPU Kernels for Any Device

Participants learned how to write optimized GPU kernels using Triton, exploring topics like block-based computation, memory access, and advanced optimizations. Hands-on exercises included copying tensors, image grey scaling, naive vs. optimized matrix multiplication, and performance benchmarking. The key takeaway was that Triton empowers developers to write high-performance kernels across GPU platforms with greater control than standard frameworks.

Running LLMs Anywhere with DirectML

The workshop featured a deep dive into quantization techniques (like int4AWQ) and deployment of models such as Llama-3, Phi-3, and Mistral on DirectML. A key insight was that DirectML enables efficient LLM execution on Windows laptops and mobile devices, outperforming traditional approaches like llama.cpp in certain workloads. Developers can now bring cutting-edge LLM capabilities to consumer hardware, broadening the accessibility of AI.

Declarative RAG and LLM Chaining with JamAI Base

The session covered the declarative paradigm for Retrieval-Augmented Generation and LLM chaining, focusing on defining “what” instead of “how.” During the hands-on portion, participants built RAG pipelines and chained LLM tasks using JamAI Base’s embedded and vector databases (SQLite + LanceDB). Real-world case studies demonstrated how declarative workflows can simplify complex AI applications. The takeaway was that JamAI Base makes it easier for developers to prototype, deploy, and manage advanced GenAI workflows with minimal overhead.

👥 Presenters

The workshop was led by the Embedded LLM team:

Tim Jian Tan
Dr. Jia Huei Tan
Dr. Ye Hor Cheong
Dr. Pin Siang Tan

Together, they blended research insights with engineering best practices to deliver an workshop that was highly technical, practical, and community-driven.

✨ Final Thoughts

This workshop highlighted how GenAI is no longer confined to the data center — it can now be run, accelerated, and optimized on virtually any device. From mastering Triton kernels to running LLMs on Windows laptops and simplifying complex workflows with JamAI Base, participants left IEEE CAI 2024 equipped to push the boundaries of AI deployment.

The success of this workshop reflects a growing interest in accessible, efficient AI infrastructure — and we’re excited to see what participants will build next.

EVENT RECAP