The Time it Wasn't DNS

Abstract

In January of 2023, the Microsoft Azure Wide Area Network experienced a global outage. If you were a Microsoft customer at the time, you were impacted by this outage. Sean walks us through the numerous factors - events and conditions - that contributed to the outage and explains why it's not always DNS. (Spoiler: sometimes it's BGP)

Speaker

Sean Klein

Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure

Sean Klein has been involved with post-incident activities for the better part of two decades. He currently leads the Production Livesite Review program for Microsoft Azure implementing modern incident analysis methodologies to more effectively learn from our most impacting incidents and outages. Previous to Microsoft, Sean worked with Salesforce as well as private consulting. He is a proud member of the Resilience in Software Foundation.

Speaker

Sean Klein

Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure

From the same track

Session Incidents

When Incidents Refuse to End

Wednesday Nov 19 / 11:45AM PST

As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.

Vanessa Huerta Granda

Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis

Session Staff Plus Engineering

The Ironies of A^2 I^2

Wednesday Nov 19 / 01:35PM PST

In this talk, we'll explore some of the "ironies" of automation—and now, artificial intelligence—in their interactions with software operators (i.e. you), especially during high consequence, high tempo situations (aka incidents).

J. Paul Reed

Staff Incident Operations Manager @Chime

Session Incident Response

Week-Long Outage: Lifelong Lessons

Wednesday Nov 19 / 02:45PM PST

Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.

Molly Struve

Staff Site Reliability Engineer @Netflix

Session Incidents

The Human Toll of Incidents & Ways To Mitigate It

Wednesday Nov 19 / 10:35AM PST

Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents.

Kyle Lexmond

Production Engineer @Meta, Previously @AWS and @Twitter

The Time it Wasn't DNS

Abstract

Speaker

Sean Klein

Find Sean Klein at:

Speaker

Sean Klein

Date

Location

Track

Topics

Share

From the same track

When Incidents Refuse to End

The Ironies of A^2 I^2

Week-Long Outage: Lifelong Lessons

The Human Toll of Incidents & Ways To Mitigate It

Follow QCon

Contact

Menu

Conferences around the World