Slack outage post mortem

11/10/2023

# Send all candidate machines matching "filter" to decom The steps for a typical rack decommission are as follows: Once a machine is sent to diskerase, the data it once stored is no longer retrievable.

The decommission process overwrites the full content of all drives in the rack using a process we call diskerase. At Google, these maintenance processes are largely automated. Because satellites undergo regular maintenance and upgrades, a number of satellite racks are being installed or decommissioned at any point in time. Racks in colos that contain our proxy machines are called satellites. While the majority of Google’s servers are located in our proprietary datacenters, we also have racks of proxy/cache machines in colocation facilities (or “colos”). A bug in our maintenance automation, combined with insufficient rate limits, caused thousands of servers carrying production traffic to simultaneously go offline. This case study features a routine rack decommission that led to an increase in service latency for our users. The second part of this chapter shares what we’ve learned about creating incentives for nurturing of a robust postmortem culture and how to recognize (and remedy) the early signs that the culture is breaking down.įinally, we provide tools and templates that you can use to bootstrap a postmortem culture.įor a comprehensive discussion on blameless postmortem philosophy, see Chapter 15 in our first book, Site Reliability Engineering. We then compare the bad postmortem with the actual postmortem that was written after the incident, highlighting the principles and best practices of a high-quality postmortem. An example of a poorly written postmortem highlights the reasons why “bad” postmortem practices are damaging to an organization that’s trying to create a healthy postmortem culture. To illustrate the principles of good postmortem writing, this chapter presents a case study of an actual outage that happened at Google. When written well, acted upon, and widely shared, postmortems can be a very effective tool for driving positive organizational change and preventing repeat outages. You can start small by introducing a very basic postmortem procedure, and then reflect and tune your process to best suit your organization-as with many things, there is no one size that fits all.

Don’t emerge from an incident hoping that your systems will eventually remedy themselves. The key takeaway from this chapter is that making this change is possible, and needn’t seem like an insurmountable challenge. Introducing postmortems into an organization is as much a cultural change as it is a technical one. Our experience shows that a truly blameless postmortem culture results in more reliable systems-which is why we believe this practice is important to creating and maintaining a successful SRE organization. Postmortem Culture: Learning from Failureīy Daniel Rogers, Murali Suriar, Sue Lueder,

0 Comments

Slack outage post mortem

Leave a Reply.

Author

Archives

Categories