Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconsf.com with any comments or concerns.

This presentation by Molly Struve, a Staff Site Reliability Engineer at Netflix, discusses the challenges and lessons learned from a significant outage that occurred during a routine Elasticsearch upgrade at a cybersecurity company, Kenna Security. The talk emphasizes the importance of psychological safety, community support, leadership, and team character, alongside technical solutions, in incident management.

Background and Context:
- The outage occurred during an upgrade from Elasticsearch 2 to 5, vital to Kenna's platform for Fortune 500 companies.
- The incident highlighted biases from past successful upgrades and reliance on familiar technology.
Key Lessons Learned:
- Avoid Complacency: Experience biases can lead to overconfidence. Encouragement to stay alert during deployments.
- Widen Your Circle: Quickly seek help from external and internal networks to accelerate problem-solving.
- Leadership Support: Effective leadership provides morale support and trusts the team's capabilities during crises.
Successful Strategies:
- Community support was critical, particularly by leveraging the Elasticsearch community's knowledge.
- Leadership offered a buffer against pressures, allowing engineers to focus on mitigation efforts.
Conclusion:
- Embrace incidents to build a strong engineering culture and learn from them.
- Create psychological safety and demonstrate vulnerability as a strength in engineering teams.

Overall, the presentation underscores the significance of both technical diligence and empathetic leadership in managing and learning from major system outages .

This is the end of the AI-generated content.

Abstract

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades. But what started as a straightforward upgrade became a week-long catastrophe that brought our platform to its knees. For six grueling days, we fought cluster instability while our Fortune 500 customers demanded answers we didn't have.

This talk shares the raw story of struggle and how disaster became the greatest teacher. The experience highlighted how creating psychological safety, leveraging community support, exceptional leadership, and team character can often matter more than technical solutions. You'll take away six hard-won lessons that will better prepare you for when your next "routine" upgrade goes sideways.

Interview:

What is your session about, and why is it important for senior software developers?

This session shares six lifelong lessons from a week-long Elasticsearch outage at my former company in 2017. While the story is entertaining—involving a critical system failure, desperate debugging, and an eventual bug discovery—the real value is in the hard-won lessons that can save folks from similar disasters. The technical lessons (having rollback plans, doing performance testing, and being wary of bias) apply to changes of any size, not just major ones. The human lessons (widening your circle, having strong leadership support, and building resilient teams) are what ultimately determine how well your team can survive a crisis. Technical leaders are uniquely positioned to implement these practices and model the culture shifts needed to handle incidents effectively.

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

As systems grow more complex and interconnected, incidents are inevitable—it's not a question of "if" but "when." We need leaders who embrace incidents as learning opportunities and create psychological safety for teams to be vulnerable, ask for help early, and grow from failures. The lessons in this talk—especially around leadership support during crises and building teams with strong character—are essential for creating resilient engineering cultures that can thrive amid inevitable disruptions.

What are the common challenges developers and architects face in this area?

The challenges tend to fall into two buckets: technical blind spots and cultural barriers. On the technical side, past success can create assumptions about future changes, and it's easy to overlook the full scope of what needs testing or planning. On the human side, there's often reluctance to ask for help early, particularly among experienced engineers who feel pressure to have all the answers. Teams also struggle with the gap between having plans on paper versus actually practicing them under realistic conditions. These challenges are universal across the industry, which is why sharing stories about them matters.

What's one thing you hope attendees will implement immediately after your talk?

Leaders commit to showing up supportively during incidents. They lean into being their team's cheerleader and defender, not their interrogator. When an incident happens, a leader should strive to remove external pressures and trust that their team will figure it out. My favorite saying is “People don't remember what you did, they remember how you made them feel.” The way leaders show up during a crisis shapes an engineering culture. Engineers will watch how they react, and early career engineers especially need to see that asking for help is a strength, not a weakness. Leaders’ composure and trust during incidents builds psychological safety that pays dividends long after the incident is resolved.

Speaker

Molly Struve

Staff Site Reliability Engineer @Netflix

Molly Struve is a Staff Site Reliability Engineer at Netflix with a degree in Aerospace Engineering from MIT. She is passionate about building reliable and scalable software and teams. Her diverse experience includes leading globally distributed teams, architecting databases, and optimizing complex systems and processes. Every day, she strives to lead by example and empower those around her by sharing all that she has learned from her time in the industry. When she isn't wrangling incidents or servers, she can be found riding and jumping her show horses.

Week-Long Outage: Lifelong Lessons

Summary

Abstract

Interview:

What is your session about, and why is it important for senior software developers?

Why is it critical for software leaders to focus on this topic right now, as we head into 2026?

What are the common challenges developers and architects face in this area?

What's one thing you hope attendees will implement immediately after your talk?

Speaker

Molly Struve

Find Molly Struve at:

Speaker

Molly Struve

Date

Location

Track

Topics

Share

From the same track

When Incidents Refuse to End

The Ironies of A^2 I^2

The Time it Wasn't DNS

The Human Toll of Incidents & Ways To Mitigate It

Follow QCon

Contact

Menu

Conferences around the World