Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts. Recovering data by itself only brings us back in time to a view of reality that might not reflect how each part sees the world. This could mean millions of different views of reality in large systems.
This talk covers challenges, patterns, and practices for disaster recovery actions in massively distributed systems. It focuses on two commonly used patterns for restoring the whole system to the same reality:
- Rebuild the world
- Restore & reconcile
We will discuss how these approaches were used in different systems, the challenges and tradeoffs experienced, and why sometimes the answer is "Why not both?" Finally, we’ll explore practices that help improve confidence and recovery time, reducing stress and ensuring things get back to working as fast as possible.
Interview:
What's the focus of your work these days?
My work centers around improving the reliability of Google's infrastructure as a service offering. A lot of the work is proactively identifying and mitigating areas of risk in the system, but also, my teams run incident response and drive the learning from the incidents process.
What's the motivation for your talk at QCon San Francisco 2023?
In all the companies I've worked for, there's been a moment where we needed to recover or repair some critical data, and the architectural decision made early on either made that moment a lot easier than it needed to be or a lot harder. I wanted to give folks some tools for reasoning about their architecture's ability to respond to disasters so maybe it will fall on the easy side for them.
How would you describe your main persona and target audience for this session?
This talk is for senior engineers who might be making big architectural decisions and system engineers that might be involved in any disaster-level response.
Is there anything specific that you'd like people to walk away with after watching your session?
I would like folks to walk away with the understanding that disaster recovery is more challenging than just backing up your data stores. If you want your system to be recoverable in a disaster, you have to make sure the architecture will support it.
Speaker
Michelle Brush
Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"
Michelle Brush is a math geek turned computer geek with over 20 years of software development experience. She has developed algorithms and data structures for pathfinding, search, compression, and data mining in embedded as well as distributed systems. In her current role as an Engineering Director, SRE for Google, she leads teams of SREs that ensure GCP's Compute Engine and Persistent Disk products are reliable. Previously, she served as the Director of HealtheIntent Architecture for Cerner Corporation, responsible for the data engineering platform for Cerner’s Population Health solutions. Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm. She is the author of 2 out of the 97 Things Every SRE Should Know.