The Stories Behind the Incidents

This track will take you behind the curtain and into the heart of system meltdowns at some of the world's leading software companies in "The stories behind the incidents" track. Learn directly from SREs about real-world, high-impact production failures at scale, including the immediate challenges of triage, diagnosis, and mitigation in complex distributed systems. From these stories, you’ll gain insights into the nature of real incidents and how skilled SREs recover from them. 

You’ll learn about the ambiguous, confusing, and uncertain nature of incidents when you’re in the middle of them, and hear the tales of how engineers were able to improvise innovative solutions in order to restore service. You’ll also learn how fundamentally unpredictable incidents are, and, consequently, the importance of preparing to be surprised.


From this track

Session

The Incident that Shaped Our Engineering Culture

Wednesday Nov 19 / 10:35AM PST

Details coming soon.

Session

War Stories from the Front Lines of Production

Wednesday Nov 19 / 11:45AM PST

Details coming soon.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session

Rebuilding A System After a Security Breach

Wednesday Nov 19 / 01:35PM PST

Details coming soon.

Session

The Bug That Never Should've Been: A Tale of Code Review, Testing, and Human Error

Wednesday Nov 19 / 02:45PM PST

Details coming soon.

Session

Postmortem of a Downtime: What Was Learned from A Big Mistake

Wednesday Nov 19 / 03:55PM PST

Details coming soon.

Track Host

Lorin Hochstein

Staff Software Engineer @Airbnb, Writes @surfingcomplexity.blog, Previously @Netflix and Member of the Resilience in Software Foundation

Lorin is a Staff Software Engineer, Reliability at Airbnb, where he wrangles and analyzes incidents, and generally works on improving the system’s ability to recover quickly from failure.

Lorin started out his career as an academic, obtaining a PhD in computer science and a tenure-track position as an assistant professor at the University of Nebraska–Lincoln. Over time he transitioned into industry, eventually finding himself on the Chaos team at Netflix, where he wrote version 2 of Chaos Monkey and worked on the Chaos Automation Platform. However, he ended up finding organic failures much more interesting, and moved into the incident space.

He is an active member of the Resilience in Software Foundation and writes frequently about software, complex systems, and incidents at surfingcomplexity.blog.

Read more
Find Lorin Hochstein at: