thomas.apestaart.org

meltdown analysis

Filed under: General — Thomas @ 17:33

2009-03-02
17:33

In this post I want to describe an approach to doing meltdown analysis. Possibly the concept is more known under a name like 'post-mortem'; I don't know because I never bothered to dig in and read up, but it sounds close. I just wanted to write down my take on the approach before having it polluted by outside information, and right after we've needed to do one at work.

When do I do a meltdown analysis ? When we've gotten into a bad situation involving multiple people that we want to learn from and avoid next time.

Why do we do a meltdown analysis ? While a meltdown is sad, you should consider it a big opportunity to get real-world failure cases that you should be able to deal with. You've already had all of the negative impact; try and get as much positive value out of this bad event as possible to balance out your work karma.

What do I do in the analysis ?

Establish a timeline based on facts
Agree on the timeline with all involved parties
Analyze each item on the timeline and brainstorm on all problems in this item
Review and create a set of rules/guidelines/process changes that should avoid each problem
Create practical TODO's from this list
Review later on to make sure the decisions have been properly applied and incorporated

So, let's dig in a little deeper.

What

Not every problem is a meltdown and warrants a thorough analysis. Indicators for considering a problem a meltdown include:

multiple mistakes were made, and not making a subset of them would have avoided the entire problem
actions taken to correct the problem made the problem worse
multiple people were involved
bad communication before and during the operation
multiple customers were affected
responsibilities are unclear

When

The first steps of the analysis should be done ASAP. The timeline should be established with memories of the problem as fresh as possible. Try and clear two-three hours out of people's schedule the day after to get this done. Insist on them adjusting as necessary and that everyone has to sacrifice time equally for this to happen. This phase is critical for all the other phases.

How

Timeline
Have a meeting ASAP after the incident. Memory of the events should be fresh. Telephone call logs should not be overwritten yet. (Don't laugh, one timeline session was hindered immensily by the fact that one person's Blackberry already did not show the times of phone calls anymore from around the problem) System logs of machines should still have all info.
Tell people not to prepare for the meeting; the idea is to collect as many raw facts as possible without any kind of bias. Once they start preparing they start thinking and looking up things and filtering their memories or even outright changing them to avoid blame.
People should bring in tools like laptops and so on that can assist in the detective work of establishing the timeline. (For example, it helps immensily to look up times when our monitoring system detected certain problems) However, don't let people wander on their laptops.
Establishing a timeline is a breadth-first approach - it is important to get the whole timeline done ASAP before investigating each item.
During this phase any discussion about responsibility, blame, or solutions is to be avoided completely. This phase is a collaborative effort where everyone works together to identify the actions taken by everyone in the group.
Make sure to separate verifiable facts from hunches/recollections/vague timing. You will notice that people will change their mind on when things happen or are much better at remembering relative order of events than actual times.
It helps to have a 'clockmaster' column for each item - usually clocks used to time events aren't synchronized. My mobile phone doesn't necessarily show the same time as the main server's log file. Separating this allows you to later on adjust to a common time after figuring out how much mobiles, watches and/or systems are out of sync with each other.
Some of the items on this list will need further verification or lookup after the meeting; make sure to allow for this.

In short, this phase should be as short as possible, establishing a complete time line of verified actions, with as much transparancy and as complete as possible.

Problem analysis
This phase is more of a brainstorm and discussion phase.
The goal is to go over each item in the timeline, and take the time to brainstorm on all the problems all participants see with this particular item. Here you go depth-first, looking at each item and discussing it to everyone's satisfaction until no new problems are seen.

For example, in our particular case, one item on the timeline reads: "X rolls back the deployment."

Based on that item, the following problems were detected:

a rollback was done without verifying if the alarms in the monitoring system actually affected the service (in our actual case, most of the alarms after deployment were triggered because the actual check was broken because of the deployment)
a rollback was done without checking if the services affected were services for actual customers (we also have internal services)
a deployment that affects the monitoring system should always be separate from other deployments (same reason as before)
rollback was (wrongly) performed using rpm -e which stopped services

Break
Usually it's a good idea to stop the meeting after this part. You're probably past the two hour mark by now, and you have a bunch of facts you still need to look up or verify, and people might want to think about more problems related to the timeline on their own.

Send people a summary of the timeline and problems up to now, then plan a follow-up meeting.

Solutions

After the first phase, get together and propose solutions and guidelines that would have helped avoid each of the problems.

It is important to consider actions for *all* of the problems; often someone says 'if we just would have not done the first thing, none of this would have happened'. Resist the urge to stop there; the goal of this exercise is to learn from all of the mistakes that have been made, to plug as many holes in your processes as possible, and to create processes that are resistant to single failures.

In this phase, the idea is to consolidate all of the different process changes and solutions into a coherent set. You try to minimize the number of changes to your process in such a way that you cover the whole set of detected problems in a way you all agree on.

TODO

After establishing a set of solutions and changes to avoid all the problems, the meeting should change from introspective to operative. Based on the list of solutions, a TODO list should be assembled so that it is clear for each participant which points should be resolved by him or her. TODO items could be anything from reviewing and updating a wiki or some documentation to taking actual corrective steps to setting up meetings to define new processes.

This part should be simple and productive.

review

The worst that can happen is that a meltdown you have already suffered repeats itself. That is, unless you are actually fine with the meltdown scenario happening - in which case it wasn't a 'meltdown' candidate to begin with.

To this end, it is important that you follow up on the practical TODO items you created.
If any of them did not get done, make sure you understand why. If the person changed their mind on the use of the TODO item, figure out if he's right or not. If he's right, you should re-think the solution that was being implemented in that TODO and the problem it was trying to solve.

Usually people forget to implement everything for the millions of reasons people forget to do or postpone things every day. Make sure you keep the ball rolling towards a point where you can consider the meltdown completely handled.

Those who cannot remember the past are condemned to repeat it.

These meltdown documents should also serve as an excellent introduction to newcomers in the company on what kind of serious problems you've ran into, how they have been handled, and why certain decisions were made.

Do you do something similar in your job ? Feel free to share your opinion. Meanwhile I'll go look over the previous meltdown notes and verify if we actually implemented everything we said we'd do.

Comments (0)

platform-4

Filed under: General — Thomas @ 12:01

2009-02-04
12:01

For the longest time our internal platform-4 release seemed like a classic code death march gone wrong. I took some comfort recently reading Dreaming in Code, the story of Chandler, and realized that we could have done worse. (That's actually a better book than I expected, but I'll save that for another post).

Anyway, platform-4 has been in production for a while now, but it is hard to create a definite milestone of 'now we did it' on a system that takes a while to deploy fully.

People involved in planning are optimists at heart - making a plan and trying to stick to it just isn't in a pessimist's nature because the pessimist thinks planning is a lost cause to start with. (This is just a personal opinion - obviously schedules fail for a lot of reasons in the real world - but I do believe this to be a fundamental piece of the planning and delay problem.)

platform-4 became the lightning rod for all the frustration the whole company was having. When things go bad people look for something obvious to blame, preferably something that can be named with a simple single word ('Bush' anyone ?) It's just basic human psychology. platform-4 became that thing for us.

So after deploying, I wanted to take back the word platform-4 and turn it into something good for everyone in a simple way. And what better way in a tech company to celebrate something than make a nice t-shirt ?

All the shirts have the same front (spot the 4), but a personalized back with a phrase relating their work to platform-4. Just like platform-4, the actual making of the shirts became a death march on its own. I started with some phrases, asked Arek and Aitor for help, and we spent a few weeks passing spreadsheets back and forth to decide on namings. It took a while to decide on the shirt type, the color, the front design, the fonts, and everything. Eduardo was patiently incorporating our feedback while trying to keep it on the down-low from his manager. In the end it probably took three months to do something as simple as get a few t-shirts made.

Almost all the shirts are in the picture, so I'd say people are happy with the results. Maybe you'll spot some of them at FOSDEM this year, because our whole development team is going, as well as a big chunk of our support team.

Please take a note of Xavier, our Product manager, second from the top left, who for some reason was excited to get on this blog. Sorry ladies, the man's already spoken for.

Big thanks to Arek and Aitor for coming up with phrases, Edu for the great design, and Jean-Noel for providing the funds!

Comments (3)

NYC part 3

Filed under: General — Thomas @ 01:18

2008-12-10
01:18

day 7

Woke up, left, took the Bolt bus back to NYC. Went to SoHo, had pizza at Ben's pizza, supposedly the best pizza in New York. Went to the Apple store, as usual it looked nice but low on content.

Went for cheesecake at Eileen's, Kristien bought some SoHo boots. Went to pick up our bags at the previous hotel to move to our new hosted bed and breakfast.

Went out for dinner at the Fork and Knife, 6 course sampler menu. Afterwards, tried to get into the Greenhouse, but it was invite only. Instead, ended up at the Bitter End on Houston (which isn't pronounced as you think), which had a pretty good band called Jason Yudoff and the New Hotness. Awesome guitar player, and a horns section.

day 8

Woke up late, went out to Central Park. Convinced the bike rental guy that, no, subzero is not a problem for us to rent a bike and go biking around the park, and yes, we understood that it would be 9$ even if we brought the bike
back after 10 minutes.

Went for a burger at the Burger Joint in Park Meridien hotel, awesome burger, worth the 30 minute queue. Tried to go to FAO Schwarz, but there was a queue of over 300 people. Went back to the apartment to meet Ingrid.

Went to Rockefeller Center and up to the observatory deck, an awesome night view of NYC. Went to the East Village, passed by Caracas Arepas on the way for a yummy Reina Pepiada and a beef arepa.

Passed by St Marks comics - a comics store that is open until 1 AM ! Planned to come back there after the show.

Walked down to the Bowery Ballroom, we were late but so was the first opening band. Second opener, Delta Spirit, was awesome - best unknown opening band I've seen in 15 years, since Kitchens of Distinction. They blend an edgy Gomez and Walkmen with a distinct southern flavor. Nada Surf was excellent, which was less of a surprise. They played to a home town and played quite a few old songs, including their first single.

Afterwards all bands were very approachable, walking around and talking to people at the bar. Matthew Caws looks just like his mother, funny to see. Of course the show ended close to 1, so no more going back for comics.

day 9

Got up, got a lovely New York breakfast from our host, Ingrid. Went to the MOMA where we were on the guest list thanks to Lucille, our host at last week's milonga. While smaller than I expected, it was still an impressive museum, and our staff guest tickets got us into the special Van Gogh exhibit easily. Most of all impressed with the design exhibition.

After MOMA, went to B&H to buy the Cowon A3 player I had set my mind to, while Kristien went boots shopping. She arrived in New York with two pairs and left the city with five.

Went to Grand Central Terminal to pick up a cool t-shirt that should have been there by now but still wasn't, ate down at the food court, then took our last metro ride back to Ingrid's. Had some trouble convincing a cab driver to take us to the airport (apparently due to shift changes), stressed about the traffic, but arrived well in time for our flight.

I've taken a few plane rides, but a night time take off from New York is a truly awesome sight to behold once you've walked through its streets.

The personal in-flight entertainment makes for a romantic experience when we count down together pushing 'play' on the same show we're going to watch.

Goodbye New York. Hope to see you back soon.

Comments (1)

RSS feed readers for task tracking

Filed under: General — Thomas @ 18:39

2008-11-10
18:39

A lot of the tasks I want to follow up on live outside of my GTD system and in various bug trackers. For a while, I've relied on mails to keep track of when I should be looking at tickets and taking some action, but that ended up not scaling very well.

So at some point I started experimenting with using RSS feed readers specifically for this purpose. My first attempt didn't work out very well.

Today, I spent some time trying again with various feed readers.

Here's the basic workflow I want to implement:
- go to my feed reader and see my 'daily' section which contains feeds for specific bug queries that I want to have a daily feedback cycle on
- See what's in there
- Go work on the tickets directly in the browser (so move away from the feed reader)
- After doing the work, go back, and refresh the feed reader to get validation that I've cleared my daily queue

Basically, I only want to use the feed reader to know when I should go do stuff somewhere else.

Now, all feed readers I tried today that actually worked seem to *never* clean up items from the RSS feed that were once in the feed but now have been removed. Is this standard, expected behaviour for a feed reader ? Am I trying to do something that's just not intended to be possible ?

Here are the feed readers I tried:
- tt-rss: doing an update does not delete the tickets that got removed from the rss feed; mailed the author
- liferea: same; though of all apps I tried today it did have the nicest user experience.
- blam: didn't show an error dialog for my https feed that had an invalid certificate; there was a console error though, and there was an option to allow invalid SSL. After that, the feed still didn't show up, and I had some GTK assertion on the console.
- snownews: was unable to open my https rss feeds with authentication; I was rooting for this one since it was command line so it would integrate easily with my gtd workflow
- gfeed: same as liferea
- evolution-rss: I was actually impressed with how well this integrated with Evolution; but afterwards, same problem as with liferea - updating the feed did not remove now-closed bugs that were gone from the rss feed.

Suggestions welcome from anyone who's tried to implement a task tracking workflow this way !

Things that I specifically want:
- store my list of feeds either on the web or in a text file I can commit to svn and sync across 3 machines
- link through to both the actual query and the individual tickets
- allow an 'update now' that removes items that are no longer in the feed
- allow https and invalid certificates
- implement username/password (for my private tracs)

Comments (12)

street play

Filed under: General — Thomas @ 13:22

2008-11-02
13:22

Last week, after close to 5 years of living in Barcelona, I finally managed to cross off an item off my todo list. I've been wanting to go out and play basketball with some friends since forever. Kristien keeps telling me I should do some sports, and I keep not wanting to go for runs or swims.

So, after a false start the week before, 4 of us finally went out with a basketball in our hands looking for a court. The first court, on Parallel, was a public one, but apparently it is pretty much taken all week by real teams practising. So we ended up going to the Raval, and found a small court with only one light working, surrounded by construction site fences. One of the fences was partly cut open to get on the court again. Two people by the side of the court who looked suspiciously like drug dealers ended up moving a block down when they noticed we were serious about wanting to play. So we were just left with some other suspicious guys eyeing our bags by the side of the court.

We ended up playing to 21, and Aitor and I lost from Arek and Jan by 20 to 21. We were all sweating like pigs afterwards, but we did play for a good 20 minutes. I find it hard to believe that 15 years ago I had no problems playing a full match, but maybe we just took more breaks ?

Anyway, I do hope we can keep it up, because it was fun! I hope we can keep it up weekly, and possibly on a better court. Jan's proposed to do it over lunch break, and take a shower afterwards, and work later to make up. Good thinking, because arriving at the restaurant afterwards with a sweaty backside did not make a good impression.

In other news, just got tickets for the New York Knicks playing the Portland Trailblazers in Madison Square Garden!

Comments (1)

« Previous Page — Next Page »

Present Perfect

meltdown analysis

2009-03-0217:33

platform-4

2009-02-0412:01