Summary
Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconsf.com with any comments or concerns.
The presentation titled When Incidents Refuse to End by Vanessa Huerta Granda explores the complexities and impacts of long-running technology incidents.
Key Points Discussed:
- Long-running incidents stretch teams and systems, revealing the distinction between "work as imagined" and "work as done" in the chaos of resolving them.
- These incidents expose organizational and system fragility, showing where there is a need for communication improvement and priority adjustments.
- The speaker emphasizes the importance of resilience, not only in technical systems but also in the people managing these incidents. This includes building systems for endurance, and better paging and escalation strategies.
- Three examples illustrate different incident experiences: a data center fire, internal systems complexity, and recurring holiday instability.
- The presentation highlights the need for a socio-technical approach to incident management, which includes understanding stress responses and leadership roles in high-pressure situations.
- Insights into handling incidents involve not only solving immediate technical problems but also managing human factors and learning from systemic issues exposed during prolonged outages.
The talk advocates for an integrated approach to incident management that considers both the technological and human elements to build stronger and more resilient systems and teams.
This is the end of the AI-generated content.
Abstract
As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out. In this talk, we’ll dive into real examples of incidents that dragged on far longer than anyone expected, and unpack what they revealed about our systems, processes, and mental models.
We’ll explore what these situations taught us about coordination under pressure, shifting system behavior, and the limitations of our current practices for detection and response. We will also look at how a mindset of curiosity helped us make sense of the mess — not just to resolve the immediate situation, but to improve how we adapt, learn, and build stronger systems and teams.
Speaker
Vanessa Huerta Granda
Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis
Vanessa is an Engineering Manager at Enova leading the Resilience Engineering team focusing on their Production Incident process, learning from incidents, and leading the on-call rotation of Incident Commanders. She previously worked as a Solutions Engineer at Jeli helping companies make the most of their incidents. In 2021 she co-authored Howie: The Post-Incident Guide, an in-depth explanation for how tech organizations can learn from incidents.
She has led the Chicago Women in Technology Conference and is an admin of the Learning From Incidents community. She is passionate about continuous improvement, getting teams to talk to each other, and Diversity and Inclusion in Tech.