Beyond Durability: Enhancing Database Resilience and Reducing the Entropy Using Write-Ahead Logging at Netflix

In modern database systems, durability guarantees are crucial but often insufficient in scenarios involving extended system outages or data corruption. Additionally, challenges arise when a single user mutation request is distributed across different database nodes or across multiple databases, necessitating a robust mechanism for mutation replay.

At Netflix, we have developed an innovative architecture that not only addresses these issues but also enhances the system's overall functionality. This talk will outline how we leverage Write-Ahead Logging (WAL) as a service to ensure all database mutations are durable and maintain ordering guarantees, even during periods when a downstream database is temporarily unavailable.

A key design principle in our approach is the separation of concerns, which helps manage system complexity and facilitates higher adoption. By employing a pluggable architecture, we enable the development of high-leverage capabilities such as secondary indexes, delayed queues, and generic replication services. These capabilities ensure data consistency across various non-distributed data stores like RocksDB, Redis, and Memcached. 

This talk explores the key considerations that must be taken into account when building WAL for various use cases.

Key Takeaways

  • How building systems with pluggable architecture helps in higher adoption and building high-leverage capabilities.
  • Keeping separation of concerns as a core design principle to manage the complexity of the system.
  • How to invest in building newer systems/architectures as the use cases evolve.

Speaker

Prudhviraj Karumanchi

Staff Software Engineer at Data Platform @Netflix, Building Large-Scale Distributed Storage Systems and Cloud Services, Previously @Oracle, @NetApp, and @EMC/Dell

Prudhviraj Karumanchi is a Staff Software Engineer at Data Platform@Netflix, building large-scale distributed storage systems and cloud services. Prudhvi is currently leading the Netflix caching infrastructure. Prior to Netflix, Prudhvi worked at large enterprises such as Oracle, NetApp, and EMC/Dell, building cloud infrastructure and contributing to File, Block, and Object storage systems

Read more
Find Prudhviraj Karumanchi at:

Speaker

Vidhya Arvind

Staff Software Engineer @Netflix Data Platform, Founding Member of Data Abstractions at Netflix, Previously @Box and @Verizon

Vidhya Arvind is a Staff Software Engineer for Netflix building abstractions. She is a founding member of Netflix’s data abstraction platform, which supports common patterns including KeyValue, Tree, TimeSeries, Table Metadata, and more. She loves learning, debugging, scaling systems, and solving hard problems. Vidhya currently spends most of her time providing scalable abstractions for thousands of Netflix developers.

Read more
Find Vidhya Arvind at:

From the same track

Session Architecture

OpenSearch Cluster Topologies for Cost-Saving Autoscaling

Tuesday Nov 19 / 11:45AM PST

The indexing rates of many clusters follow some sort of fluctuating pattern - be it day/night, weekday/weekend, or any sort of duality when the cluster changes from being active to less active.  In these cases how does one scale the cluster?

Speaker image - Amitai Stern

Amitai Stern

Engineering Manager @Logz.io, Managing Observability Data Storage of Petabyte Scale, OpenSearch Leadership Committee Member and Contributor

Session

Stream and Batch Processing Convergence in Apache Flink

Tuesday Nov 19 / 02:45PM PST

The idea of executing streaming and batch jobs with one engine has been there for a while. People always say batch is a special case of streaming. Conceptually, it is.

Speaker image - Becket Qin

Becket Qin

Principal Staff Software Engineer @LinkedIn

Session Data Pipelines

Efficient Incremental Processing with Netflix Maestro and Apache Iceberg

Tuesday Nov 19 / 03:55PM PST

Incremental processing, an approach that processes only new or updated data in workflows, substantially reduces compute resource costs and execution time, leading to fewer potential failures and less need for manual intervention.

Speaker image - Jun He

Jun He

Staff Software Engineer @Netflix, Managing and Automating Large-Scale Data/ML Workflows, Previously @Airbnb and @Hulu

Session

Stream All the Things — Patterns of Effective Data Stream Processing

Tuesday Nov 19 / 01:35PM PST

Data streaming is a really difficult problem. Despite 10+ years of attempting to simplify it, teams building real-time data pipelines can spend up to 80% of their time optimizing it or fixing downstream output by handling bad data at the lake.

Speaker image - Adi Polak

Adi Polak

Director, Advocacy and Developer Experience Engineering @Confluent

Session

Unconference: Shift-Left Data Architecture

Tuesday Nov 19 / 05:05PM PST