Generative AI landscape is rapidly changing as new models are appearing in horizon every few days. However, the hardware and software characteristics of these models have many similar patterns and execution phases.
In this talk, we will use Llama2 as base model to highlight basic characterization. We will present a detailed analysis of Llama2 workload performance on a platform powered by the AMD EPYC Processor. All our analysis was completed using the latest multi-core CPU servers. This includes scalability analysis as well as detailed phase by phase analysis to detect software and hardware bottlenecks at various stages.
Based on this, we will share our recommendations for tuning, optimization, and deployment best practices for the software stack with consideration of the hardware on which it is deployed. We extend our analysis to Llama3 using architecture relevant software optimization and share the best deployment practices relevant to the most AI inference deployment use cases.
Speaker
Speaker
Rema Hariharan
Principal Engineer @AMD, Seasoned Performance Engineer With a Base in Quantitative Sciences and a Penchant for Root-Causing
Dr. Rema Hariharan is an engineer known for her quantitative approach to solving complex engineering challenges. With a foundation in Engineering and advanced expertise in Operations Research, her career spans diverse optimizations, from inventory control and credit management to in-depth performance analysis of network systems and computer hardware.
Beginning her career at AT&T Bell Labs, Dr. Hariharan has contributed her expertise to leading technology companies, including Sun Microsystems, eBay, and AMD. In her current role at AMD, she focuses on optimizing the performance of AI models on AMD hardware, driving efficiency and innovation in the field.