The Human Toll of Incidents & Ways To Mitigate It

Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconsf.com with any comments or concerns.

The presentation by Kyle Lexmond focuses on understanding the impact of significant incidents on individuals involved and proposes methods to make incident management more considerate.

Key Points Discussed:

  • Definition of an Incident: Incidents are events impacting business metrics in a negative direction, requiring immediate response, and necessitating mitigation to restore normalcy. The emphasis is on mitigating impacts rather than solving the entire problem.
  • Human Aspect of Incidents: The talk emphasizes the pressure and emotional impact on individuals involved in incident management. Factors influencing this include personal pride, professional obligations, and company optics.
  • Incident Management Process:
    • Focuses on efficient coordination and responsibility allocation, with clear communication as a key element.
    • The importance of having a dedicated incident manager to streamline operations and facilitate better outcome management.
    • Encourages the use of collaborative tools like shared documentation for real-time updates and future reviews.
  • Mitigation over Solving: Lexmond points out the importance of focusing on mitigation, i.e., reducing customer impact promptly rather than a complete fix during an active incident.
  • Learning from Incidents: Incidents are not inherently negative; they can be avenues for learning and reprioritizing work. Incident management should be guided by human-centered approaches to alleviate stress and improve responses.

Conclusion: The overarching theme of the talk encourages considering human factors in incident management to drive efficient and compassionate responses to technical setbacks.

This is the end of the AI-generated content.


Abstract

Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents. Learn about some actions that could help you while you respond to the next outage, as well as changes you can drive to make incident response more considerate of the humans involved.


Speaker

Kyle Lexmond

Production Engineer @Meta, Previously @AWS and @Twitter

Kyle is an almost-SWE who learned about Site Reliability Engineering in passing conversation during university, changing the course of his career. Having worked at big names (Twitter, Amazon, Facebook) and small (CBSA, Kik), he enjoys working on building optimized and efficient systems that break less often after he touches them. He currently lives in Seattle with a partner and an adorable dog. (Yes, he has pictures.)

Read more

Date

Wednesday Nov 19 / 10:35AM PST ( 50 minutes )

Location

Seacliff ABC

Topics

Incidents Failures Personal Resiliency

Slides

Slides are not available

Share

From the same track

Session Incidents

When Incidents Refuse to End

Wednesday Nov 19 / 11:45AM PST

As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session Staff Plus Engineering

The Ironies of A^2 I^2

Wednesday Nov 19 / 01:35PM PST

In this talk, we'll explore some of the "ironies" of automation—and now, artificial intelligence—in their interactions with software operators (i.e. you), especially during high consequence, high tempo situations (aka incidents).

Speaker image - J. Paul Reed

J. Paul Reed

Staff Incident Operations Manager @Chime

Session Incident Analysis

The Time it Wasn't DNS

Wednesday Nov 19 / 03:55PM PST

In January of 2023, the Microsoft Azure Wide Area Network experienced a global outage. If you were a Microsoft customer at the time, you were impacted by this outage.

Speaker image - Sean Klein

Sean Klein

Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure

Session Incident Response

Week-Long Outage: Lifelong Lessons

Wednesday Nov 19 / 02:45PM PST

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.

Speaker image - Molly Struve

Molly Struve

Staff Site Reliability Engineer @Netflix