Week-Long Outage: Lifelong Lessons

Abstract

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades. But what started as a straightforward upgrade became a week-long catastrophe that brought our platform to its knees. For six grueling days, we fought cluster instability while our Fortune 500 customers demanded answers we didn't have.

This talk shares the raw story of struggle and how disaster became the greatest teacher. The experience highlighted how creating psychological safety, leveraging community support, exceptional leadership, and team character can often matter more than technical solutions. You'll take away six hard-won lessons that will better prepare you for when your next "routine" upgrade goes sideways.

Interview:

What is your session about, and why is it important for senior software developers?

This session shares six lifelong lessons from a week-long Elasticsearch outage at my former company in 2017. While the story is entertaining—involving a critical system failure, desperate debugging, and an eventual bug discovery—the real value is in the hard-won lessons that can save folks from similar disasters. The technical lessons (having rollback plans, doing performance testing, and being wary of bias) apply to changes of any size, not just major ones. The human lessons (widening your circle, having strong leadership support, and building resilient teams) are what ultimately determine how well your team can survive a crisis. Technical leaders are uniquely positioned to implement these practices and model the culture shifts needed to handle incidents effectively.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

As systems grow more complex and interconnected, incidents are inevitable—it's not a question of "if" but "when." We need leaders who embrace incidents as learning opportunities and create psychological safety for teams to be vulnerable, ask for help early, and grow from failures. The lessons in this talk—especially around leadership support during crises and building teams with strong character—are essential for creating resilient engineering cultures that can thrive amid inevitable disruptions.

What are the common challenges developers and architects face in this area?

The challenges tend to fall into two buckets: technical blind spots and cultural barriers. On the technical side, past success can create assumptions about future changes, and it's easy to overlook the full scope of what needs testing or planning. On the human side, there's often reluctance to ask for help early, particularly among experienced engineers who feel pressure to have all the answers. Teams also struggle with the gap between having plans on paper versus actually practicing them under realistic conditions. These challenges are universal across the industry, which is why sharing stories about them matters.

What's one thing you hope attendees will implement immediately after your talk?

Leaders commit to showing up supportively during incidents. They lean into being their team's cheerleader and defender, not their interrogator. When an incident happens, a leader should strive to remove external pressures and trust that their team will figure it out. My favorite saying is “People don't remember what you did, they remember how you made them feel.” The way leaders show up during a crisis shapes an engineering culture. Engineers will watch how they react, and early career engineers especially need to see that asking for help is a strength, not a weakness. Leaders’ composure and trust during incidents builds psychological safety that pays dividends long after the incident is resolved.


Speaker

Molly Struve

Staff Site Reliability Engineer @Netflix

Molly Struve is a Staff Site Reliability Engineer at Netflix with a degree in Aerospace Engineering from MIT. She is passionate about building reliable and scalable software and teams. Her diverse experience includes leading globally distributed teams, architecting databases, and optimizing complex systems and processes. Every day, she strives to lead by example and empower those around her by sharing all that she has learned from her time in the industry. When she isn't wrangling incidents or servers, she can be found riding and jumping her show horses.

Read more

Date

Wednesday Nov 19 / 02:45PM PST ( 50 minutes )

Location

Ballroom BC

Topics

Incident Response On-Call Outage Reliability SRE

Share

From the same track

Session Incidents

When Incidents Refuse to End

Wednesday Nov 19 / 11:45AM PST

As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session Staff Plus Engineering

The Ironies of AAII

Wednesday Nov 19 / 01:35PM PST

Details coming soon.

Speaker image - Paul Reed

Paul Reed

Staff Incident Operations Manager @Chime

Session Incident Analysis

The Time it Wasn't DNS

Wednesday Nov 19 / 03:55PM PST

In January of 2023, the Microsoft Azure Wide Area Network experienced a global outage. If you were a Microsoft customer at the time, you were impacted by this outage.

Speaker image - Sean Klein

Sean Klein

Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure

Session Incidents

The Human Toll of Incidents & Ways To Mitigate It

Wednesday Nov 19 / 10:35AM PST

Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents.

Speaker image - Kyle Lexmond

Kyle Lexmond

Production Engineer @Meta, Previously @AWS and @Twitter