How Netflix Shapes our Fleet for Efficiency and Reliability

Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconsf.com with any comments or concerns.

The presentation "How Netflix Shapes our Fleet for Efficiency and Reliability" is delivered by Joseph Lynch and Argha C, who discuss strategies Netflix uses to optimize service efficiency and reliability globally.

Main Topics Discussed:

  • Capacity Modeling: Netflix models capacity requirements automatically to optimize hardware use based on business criticality, workload characterization, and AWS hardware performance.
  • Optimization Loop: Continuous monitoring and reshaping of fleets ensure that resources meet changing workload patterns cost-effectively while maintaining service availability.
  • Global Deployment: Netflix utilizes a complex, multi-region infrastructure paired with a distributed CDN to manage diverse demand from various devices worldwide, such as mobiles, PCs, and TVs.
  • Efficiency and Failures: Discussing the balance between resource usage and cost due to failures, highlighting that downtimes like Amazon's cost $20,000 per second.
  • Workload Characteristics: Workloads are analyzed not only by utilization but also by their impact and recovery speed, aiming to minimize potential risks and losses.
  • Hardware Variety: Understanding the variability and availability of different hardware generations enables optimized utilization.
  • Load Shedding: Strategies to handle unpredictable shifts in demand by shedding non-critical loads to maintain service for high-priority tasks.

Key Insights:

  • Efficiency is about matching hardware resources with workload needs while considering failure costs.
  • The balance involves multiple considerations like traffic, capacity planning, and server type selection under varied conditions.
  • Global infrastructure supports the need for seamless, reliable streaming across multiple devices.
  • Adaptive management allows quick reactions to traffic changes, achieving significant recovery time reductions during spikes.

The session provides a comprehensive view of the technical strategies and operational frameworks that Netflix employs to maintain its global streaming service efficiently and reliably, adapting continuously to dynamic demand and capacity shifts.

This is the end of the AI-generated content.


Abstract

Netflix runs on a complex multi-layer cloud architecture made up of thousands of services, caches, and databases. As hardware options, workload patterns, cost dynamics and the Netflix products evolve, the cost-optimal hardware and configuration for running our services is constantly changing. It is no longer sufficient in modern cloud computing to buy large amounts of the same shape of computer and try to pack every workload on that with large fixed buffers, both for efficiency and availability reasons. It is also no longer sufficient for platform teams to work 1:1 with every service team to optimize their hardware selection, this does not scale.

This talk shows an alternative strategy, where each workload is placed on price-optimal hardware using automated understanding of hardware performance combined with workload characterization. Furthermore, as workload patterns shift, we can continuously re-evaluate and react for every cluster to ensure business outcomes for minimal spend.

We will start with understanding how we automatically model capacity requirements, including key concepts like service buffer allocation based on business criticality. Then we will show how we marry this understanding of workload needs with a deep understanding of AWS hardware performance and pricing to place each workload on efficient hardware. Finally, we will walk through the continuously running optimization loop, which monitors, detects changes, and re-shapes our fleet to maintain business outcomes as load patterns constantly change.

Even with all this planning, our systems still face unexpected load shifts that exceed modeled bounds, so to close we will briefly cover how we manage traffic demand and compute supply to ensure we can maintain availability while intelligently and rapidly injecting capacity into the right server groups to keep Netflix up and running and our customers happily streaming.


Speaker

Joseph Lynch

Principal Software Engineer @Netflix Building Highly-Reliable and High-Leverage Infrastructure Across Stateless and Stateful Services

Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. He led the shift of the Netflix data tier to abstraction, driving resilience through a Data Gateway architecture. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time building automated capacity management and resiliency features into the Netflix fleet.

Read more
Find Joseph Lynch at:

Speaker

Argha C

Staff Software Engineer @Netflix - Leading Netflix's Cloud Scalability Efforts for Live

Argha C is a Staff Software Engineer at Netflix who leads Netflix's Cloud Scalability efforts for Live. Over the years, he has led key business initiatives focused on scaling and availability at the Netflix Edge, including architecting for efficient DDoS mitigation techniques. Recently, he has been redefining Netflix's approach to capacity management and Compute efficiency, while extending learnings from the Edge to ensure fleetwide resilience for Netflix's Cloud infrastructure.

Read more
Find Argha C at:

From the same track

Session AI Architecture

Realtime and Batch Processing of GPU Workloads

Wednesday Nov 19 / 01:35PM PST

SS&C Technologies runs 47 trillion dollars of assets on our global private cloud. We have the primitives for infrastructure as well as platforms as a service like Kubernetes, Kafka, NiFi, Databases, etc.

Speaker image - Joseph Stein

Joseph Stein

Principal Architect of Research & Development @SS&C Technologies, Previous Apache Kafka Committer and PMC Member

Session Architecture

From ms to µs: OSS Valkey Architecture Patterns for Modern AI

Wednesday Nov 19 / 02:45PM PST

As AI applications demand faster and more intelligent data access, traditional caching strategies are hitting performance and reliability limits.

Speaker image - Dumanshu Goyal

Dumanshu Goyal

Uber Technical Lead @Airbnb Powering $11B Transactions, Formerly @Google and @AWS

Session AI/ML

Producing the World's Cheapest Tokens: A How-to Guide

Wednesday Nov 19 / 10:35AM PST

AI inference is expensive, but it doesn’t have to be. In this talk, we’ll break down how to systematically drive down the cost per token across different types of AI workloads.

Speaker image - Meryem Arik

Meryem Arik

Co-Founder and CEO @Doubleword (Previously TitanML), Recognized as a Technology Leader in Forbes 30 Under 30, Recovering Physicist

Session Platform Engineering

Write-Ahead Intent Log: A Foundation for Efficient CDC at Scale

Wednesday Nov 19 / 03:55PM PST

As companies grow, so does the complexity of keeping distributed systems in sync. At DoorDash, we tackled this challenge while building a high-throughput, domain-oriented data platform for capturing changes across hundreds of services.

Speaker image - Vinay Chella

Vinay Chella

Engineering Leader @DoorDash - Specializing in Distributed Systems, Streaming & Storage Platforms, Apache Cassandra Committer, Previously Engineering Leader @Netflix

Speaker image - Akshat Goel

Akshat Goel

Staff Software Engineer, Core Infra at @DoorDash, Previously Senior Software Engineer @Amazon