In this post I want to describe an approach to doing meltdown analysis. Possibly the concept is more known under a name like ‘post-mortem'; I don’t know because I never bothered to dig in and read up, but it sounds close. I just wanted to write down my take on the approach before having it polluted by outside information, and right after we’ve needed to do one at work.
When do I do a meltdown analysis ? When we’ve gotten into a bad situation involving multiple people that we want to learn from and avoid next time.
Why do we do a meltdown analysis ? While a meltdown is sad, you should consider it a big opportunity to get real-world failure cases that you should be able to deal with. You’ve already had all of the negative impact; try and get as much positive value out of this bad event as possible to balance out your work karma.
What do I do in the analysis ?
- Establish a timeline based on facts
- Agree on the timeline with all involved parties
- Analyze each item on the timeline and brainstorm on all problems in this item
- Review and create a set of rules/guidelines/process changes that should avoid each problem
- Create practical TODO’s from this list
- Review later on to make sure the decisions have been properly applied and incorporated
So, let’s dig in a little deeper.
Not every problem is a meltdown and warrants a thorough analysis. Indicators for considering a problem a meltdown include:
- multiple mistakes were made, and not making a subset of them would have avoided the entire problem
- actions taken to correct the problem made the problem worse
- multiple people were involved
- bad communication before and during the operation
- multiple customers were affected
- responsibilities are unclear
The first steps of the analysis should be done ASAP. The timeline should be established with memories of the problem as fresh as possible. Try and clear two-three hours out of people’s schedule the day after to get this done. Insist on them adjusting as necessary and that everyone has to sacrifice time equally for this to happen. This phase is critical for all the other phases.
Have a meeting ASAP after the incident. Memory of the events should be fresh. Telephone call logs should not be overwritten yet. (Don’t laugh, one timeline session was hindered immensily by the fact that one person’s Blackberry already did not show the times of phone calls anymore from around the problem) System logs of machines should still have all info.
Tell people not to prepare for the meeting; the idea is to collect as many raw facts as possible without any kind of bias. Once they start preparing they start thinking and looking up things and filtering their memories or even outright changing them to avoid blame.
People should bring in tools like laptops and so on that can assist in the detective work of establishing the timeline. (For example, it helps immensily to look up times when our monitoring system detected certain problems) However, don’t let people wander on their laptops.
Establishing a timeline is a breadth-first approach – it is important to get the whole timeline done ASAP before investigating each item.
During this phase any discussion about responsibility, blame, or solutions is to be avoided completely. This phase is a collaborative effort where everyone works together to identify the actions taken by everyone in the group.
Make sure to separate verifiable facts from hunches/recollections/vague timing. You will notice that people will change their mind on when things happen or are much better at remembering relative order of events than actual times.
It helps to have a ‘clockmaster’ column for each item – usually clocks used to time events aren’t synchronized. My mobile phone doesn’t necessarily show the same time as the main server’s log file. Separating this allows you to later on adjust to a common time after figuring out how much mobiles, watches and/or systems are out of sync with each other.
Some of the items on this list will need further verification or lookup after the meeting; make sure to allow for this.
In short, this phase should be as short as possible, establishing a complete time line of verified actions, with as much transparancy and as complete as possible.
This phase is more of a brainstorm and discussion phase.
The goal is to go over each item in the timeline, and take the time to brainstorm on all the problems all participants see with this particular item. Here you go depth-first, looking at each item and discussing it to everyone’s satisfaction until no new problems are seen.
For example, in our particular case, one item on the timeline reads: “X rolls back the deployment.”
Based on that item, the following problems were detected:
- a rollback was done without verifying if the alarms in the monitoring system actually affected the service (in our actual case, most of the alarms after deployment were triggered because the actual check was broken because of the deployment)
- a rollback was done without checking if the services affected were services for actual customers (we also have internal services)
- a deployment that affects the monitoring system should always be separate from other deployments (same reason as before)
- rollback was (wrongly) performed using rpm -e which stopped services
Usually it’s a good idea to stop the meeting after this part. You’re probably past the two hour mark by now, and you have a bunch of facts you still need to look up or verify, and people might want to think about more problems related to the timeline on their own.
Send people a summary of the timeline and problems up to now, then plan a follow-up meeting.
After the first phase, get together and propose solutions and guidelines that would have helped avoid each of the problems.
It is important to consider actions for *all* of the problems; often someone says ‘if we just would have not done the first thing, none of this would have happened’. Resist the urge to stop there; the goal of this exercise is to learn from all of the mistakes that have been made, to plug as many holes in your processes as possible, and to create processes that are resistant to single failures.
In this phase, the idea is to consolidate all of the different process changes and solutions into a coherent set. You try to minimize the number of changes to your process in such a way that you cover the whole set of detected problems in a way you all agree on.
After establishing a set of solutions and changes to avoid all the problems, the meeting should change from introspective to operative. Based on the list of solutions, a TODO list should be assembled so that it is clear for each participant which points should be resolved by him or her. TODO items could be anything from reviewing and updating a wiki or some documentation to taking actual corrective steps to setting up meetings to define new processes.
This part should be simple and productive.
The worst that can happen is that a meltdown you have already suffered repeats itself. That is, unless you are actually fine with the meltdown scenario happening – in which case it wasn’t a ‘meltdown’ candidate to begin with.
To this end, it is important that you follow up on the practical TODO items you created.
If any of them did not get done, make sure you understand why. If the person changed their mind on the use of the TODO item, figure out if he’s right or not. If he’s right, you should re-think the solution that was being implemented in that TODO and the problem it was trying to solve.
Usually people forget to implement everything for the millions of reasons people forget to do or postpone things every day. Make sure you keep the ball rolling towards a point where you can consider the meltdown completely handled.
Those who cannot remember the past are condemned to repeat it.
These meltdown documents should also serve as an excellent introduction to newcomers in the company on what kind of serious problems you’ve ran into, how they have been handled, and why certain decisions were made.
Do you do something similar in your job ? Feel free to share your opinion. Meanwhile I’ll go look over the previous meltdown notes and verify if we actually implemented everything we said we’d do.