As AI technologies continue to evolve, the demand for processing both structured and unstructured data across diverse industries is rapidly growing. However, scaling AI batch processing across thousands of GPUs presents significant challenges in maintaining scalability, reliability, and observability. These challenges are further amplified when aiming for high-throughput batch data processing with large language models (LLMs), due to their computational demands and complexity.
In this presentation, we will demonstrate how we built a scalable and efficient batch inference stack using Ray at Anyscale. We begin by introducing Ray as a robust, scalable AI compute engine, followed by an in-depth look at RayData, a versatile and high-performance deep learning data processing pipeline. Next, we will introduce vLLM, the leading open-source framework for LLM inference, and illustrate how the combination of RayData and vLLM offers an ideal solution for scalable batch inference.
Speaker
Cody Yu
Staff Software Engineer and Tech Lead @Anyscale, Ex-Amazonian, vLLM Committer, Apache TVM PMC
Cody Yu is a staff software engineer and a tech lead at Anyscale, working on LLM inference performance optimization. He is a community member of various popular open source projects such as vLLM, SGLang and Apache TVM. Before Anyscale, Cody was a founding engineer at BosonAI, as well as a senior applied scientist at AWS AI. His recent research is in hardware acceleration and performance optimization for LLM systems.