You are viewing content from a past/completed conference.
The Ironies of A^2 I^2
Abstract
In this talk, we'll explore some of the "ironies" of automation—and now, artificial intelligence—in their interactions with software operators (i.e. you), especially during high consequence, high tempo situations (aka incidents).
We'll also look at considerations when building automation and integrating AI into your systems and workflows, including how we reason about them when they go awry and some food for thought on the role both AI _and automation_ play in your next incident. Also? "Fun" incident stories!
Speaker
J. Paul Reed
Staff Incident Operations Manager @Chime
J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful boutique consulting firm, he now spends his days as a Staff Incident Operations Manager at Chime, focusing on incident response, analysis, and systemic risk identification. He's worked with such organizations as VMware, Mozilla, Symantec, and Netflix.
Read more
From the same track
Session
Incidents
When Incidents Refuse to End
Wednesday Nov 19 / 11:45AM PST
As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.
Vanessa Huerta Granda
Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis
When Incidents Refuse to End
Session
Incident Analysis
The Time it Wasn't DNS
Wednesday Nov 19 / 03:55PM PST
In January of 2023, the Microsoft Azure Wide Area Network experienced a global outage. If you were a Microsoft customer at the time, you were impacted by this outage.
Sean Klein
Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure
The Time it Wasn't DNS
Session
Incident Response
Week-Long Outage: Lifelong Lessons
Wednesday Nov 19 / 02:45PM PST
Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.
Molly Struve
Staff Site Reliability Engineer @Netflix
Week-Long Outage: Lifelong Lessons
Session
Incidents
The Human Toll of Incidents & Ways To Mitigate It
Wednesday Nov 19 / 10:35AM PST
Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents.
Kyle Lexmond
Production Engineer @Meta, Previously @AWS and @Twitter
The Human Toll of Incidents & Ways To Mitigate It