A Framework for Building Micro Metrics for LLM System Evaluation

LLM accuracy is a challenging topic to address and is much more multi dimensional than a simple accuracy score. In this talk we’ll dive deeper into how to measure LLM related metrics, going through examples, case studies and techniques beyond just a single accuracy and score. We’ll discuss how to create, track and revise micro LLM metrics to have granular direction for improving LLM models.

Interview:

What is the focus of your work?

I'm mainly focused on applied research and helping teams build in the LLM and conversational AI space. The goal is to look at industry challenges, create accessible and practical research and guides that help us create better conversational experiences.

What’s the motivation for your talk?

Each problem in the AI space, or any use case has unique challenges. There has been a lot of focus on catch-all metrics, but once you've been serving production traffic you'll find edge cases and scenarios you want to measure. This is where micro metrics can help, defining specific outputs and behaviors that you want to track for your use case.

Who is your talk for?

A range between intermediate and senior developer and product lead. The concepts are pretty standard from a product, ML and software perspective - the learning comes from thinking through the provided case studies and how they can be applied to your own use cases.

What do you think is the next big disruption in software?

Figuring out how to prompt and steer multi model models. Speech to speech is very exciting, but businesses need the ability to check for hallucinations and integrate with other services before responding to users.


Speaker

Denys Linkov

Head of ML @Voiceflow, LinkedIn Learning Instructor, ML Advisor and Instructor, Previously @LinkedIn

Denys leads Enterprise AI at Voiceflow, is a ML Startup Advisor and Linkedin Learning Course Instructor. He's worked with 50+ enterprises in their conversational AI journey, and his Gen AI courses have helped 150,000+ learners build key skills. He's worked across the AI product stack, being hands-on building key ML systems, managing product delivery teams, and working directly with customers on best practices.

Read more
Find Denys Linkov at:

Date

Tuesday Nov 19 / 05:05PM PST ( 50 minutes )

Location

Ballroom BC

Topics

Machine Learning LLMs Evaluations Case Study Lessons Learned

Share

From the same track

Session AI/ML

Search: from Linear to Multiverse

Tuesday Nov 19 / 02:45PM PST

The future of search is undergoing a revolutionary transformation, shifting from traditional linear queries to a rich multiverse of possibilities powered by AI.

Speaker image - Faye Zhang

Faye Zhang

Staff Software Engineer @Pinterest, Tech Lead on GenAI Search Traffic Projects, Speaker, Expert in AI/ML with a Strong Background in Large Distributed System

Session LLMOps

Navigating LLM Deployment: Tips, Tricks, and Techniques

Tuesday Nov 19 / 01:35PM PST

Self-hosted Language Models are going to power the next generation of applications in critical industries like financial services, healthcare, and defense.

Speaker image - Meryem Arik

Meryem Arik

Co-Founder @TitanML, Recognized as a Technology Leader in Forbes 30 Under 30, Recovering Physicist

Session Generative AI

GenAI for Productivity

Tuesday Nov 19 / 11:45AM PST

At Wealthsimple, we leverage Generative AI internally to improve operational efficiency and streamline monotonous tasks. Our GenAI stack is a blend of tools we developed in house and third party solutions.

Speaker image - Mandy Gu

Mandy Gu

Senior Software Development Manager @Wealthsimple

Session AI/ML

10 Reasons Your Multi-Agent Workflows Fail and What You Can Do About It

Tuesday Nov 19 / 03:55PM PST

Multi-agent systems – a setup where multiple agents (generative AI models with access to tools) collaborate to solve complex tasks – are an emerging paradigm for building applications.

Speaker image - Victor Dibia

Victor Dibia

Principal Research Software Engineer @Microsoft Research, Core Contributor to AutoGen, Author of "Multi-Agent Systems with AutoGen" book. Previously @Cloudera, @IBMResearch

Session

Scaling Large Language Model Serving Infrastructure at Meta

Tuesday Nov 19 / 10:35AM PST

Running LLMs requires significant computational power, which scales with model size and context length. We will discuss strategies for fitting models to various hardware configurations and share techniques for optimizing inference latency and throughput at Meta.

Speaker image - Ye (Charlotte) Qi

Ye (Charlotte) Qi

Senior Staff Engineer @Meta