ZDNet site was down for several hours during recent Apple’s iPad event in San Francisco. Their VP of technical operations decided to reach out to the site users and wrote a post-mortem on their blog, explaining what happened: Post mortem: Our site fail Wednesday and what went wrong.
You have to give him credit for reaching out to the site users, trying to explain things on a non-technical level. One thing that immediately caught my attention was the size of this post-mortem and the content…
The goal of these post-mortems, or incident reports, or RFO (Reason For Outage) as many folks like to call them, is to briefly outline the following things:
- Timeline of the outage
- Root cause of failure
- Short description of what has happened
- Actions and corresponding timestamps taken to resolve the issue
- Specific action items to work on to prevent this from happening again
John has obviously tried to cover some of these points, but reading his post-mortem was like reading a book to me. Trust me – no one in operations team including his “overwhelmed team” has time to read this book. Post-mortem must be precise and to the point. I don’t need to know, for example, where your team members were when this happened. All i want to know is the time it took your team to respond. When describing actions that were taken to resolve the problem, write them short and include the time stamp for each action taken.
When outlining the action items your team is planning to take to prevent this form happening again in the future, be precise and list things that are measurable, and if taken, will address the problem.
“Planning to review the load-balancer configuration” is not really an action item, but review and increase the keepalive check time-out interval, for example is. Or “review how to prevent problems” is not an action item either, that if acted upon will fix the problem. And guess what – if all you plan to do is “review” things, then that’s exactly what will be done and nothing more.