The Ironies of A^2 I^2

Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconsf.com with any comments or concerns.

The Ironies of A^2 I^2 is a presentation by J. Paul Reed, focusing on the complexities and unexpected outcomes of using automation and artificial intelligence (AI) during high-stakes incidents in software operations.

Key Points Discussed:

  • The ironies of automation, drawn from Bainbridge's 1983 paper, highlight how automation can create scenarios where human intervention becomes crucial but is often challenged by the lack of transparency and predictability in automated systems.
  • Artificial Intelligence is examined as a more advanced form of automation, which presents additional challenges due to its reliance on learning algorithms and lack of causal models, making it difficult to handle novel situations.
  • The concept of joint cognitive systems is introduced, emphasizing the importance of human and AI interaction within defined system boundaries during incident responses.

Challenges with Automation and AI:

  • The animacy paradox describes how automated systems can seem to act independently during incidents, complicating human operators' ability to control and understand them.
  • AI's lack of transparency and unpredictable behavior often leads to difficulties in incident response, where operators rely on mental models to diagnose and resolve issues.
  • There is a need for better coordination and communication within teams when AI is involved in incident management to prevent extended incident durations.

Concluding Thoughts:

  • Reed emphasizes that while AI can provide significant benefits, it is imperative to be aware of its limitations and ensure that operators are informed about AI's use during incidents to marshal appropriate resources and responses.
  • The continuing enthusiasm for AI must be balanced with a realistic assessment of its current capabilities and the contexts in which it performs well.

Reed's presentation urges a discussion on the judicious use of AI, understanding its limitations, and maintaining human oversight to ensure effective incident response and resolution.

This is the end of the AI-generated content.


Abstract

In this talk, we'll explore some of the "ironies" of automation—and now, artificial intelligence—in their interactions with software operators (i.e. you), especially during high consequence, high tempo situations (aka incidents).

We'll also look at considerations when building automation and integrating AI into your systems and workflows, including how we reason about them when they go awry and some food for thought on the role both AI _and automation_ play in your next incident. Also? "Fun" incident stories!


Speaker

J. Paul Reed

Staff Incident Operations Manager @Chime

J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful boutique consulting firm, he now spends his days as a Staff Incident Operations Manager at Chime, focusing on incident response, analysis, and systemic risk identification. He's worked with such organizations as VMware, Mozilla, Symantec, and Netflix.

Read more

Date

Wednesday Nov 19 / 01:35PM PST ( 50 minutes )

Location

Seacliff ABC

Slides

Slides are not available

Share

From the same track

Session Incidents

When Incidents Refuse to End

Wednesday Nov 19 / 11:45AM PST

As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session Incident Analysis

The Time it Wasn't DNS

Wednesday Nov 19 / 03:55PM PST

In January of 2023, the Microsoft Azure Wide Area Network experienced a global outage. If you were a Microsoft customer at the time, you were impacted by this outage.

Speaker image - Sean Klein

Sean Klein

Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure

Session Incident Response

Week-Long Outage: Lifelong Lessons

Wednesday Nov 19 / 02:45PM PST

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.

Speaker image - Molly Struve

Molly Struve

Staff Site Reliability Engineer @Netflix

Session Incidents

The Human Toll of Incidents & Ways To Mitigate It

Wednesday Nov 19 / 10:35AM PST

Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents.

Speaker image - Kyle Lexmond

Kyle Lexmond

Production Engineer @Meta, Previously @AWS and @Twitter