thomas.apestaart.org

job search

Filed under: Fun — Thomas @ 14:41

2009-03-05
14:41

I've never had to actually send a job application to a company. Maybe that's one of the reasons why I like Overqualified so darn much.

It's been silent for a long while, but now Joey seems to have found new inspiration recently.
Today's instalment had me in stitches. Re-appropriating a cover letter for literary purposes is one of those genius internet ideas I wish I'd had.

Apparently there's a book now - I should go buy it.

Comments (2)

definately

Filed under: General — Thomas @ 15:20

2009-03-03
15:20

I was considering to defiantly refuse any admonitions to correct one of my most common spelling mistakes, but when a close friend decided to go past his self-imposed boundaries of passive-~~agressiveness~~aggressiveness to complain, I caved and replaced all ~~occurences~~occurrences of 'definately' with 'definitely'. That is, except for the two in this post.

I will try better in the future, promise! And to my friend in particular, I'm sorry I daintified our relationship.

Comments (9)

meltdown analysis

Filed under: General — Thomas @ 17:33

2009-03-02
17:33

In this post I want to describe an approach to doing meltdown analysis. Possibly the concept is more known under a name like 'post-mortem'; I don't know because I never bothered to dig in and read up, but it sounds close. I just wanted to write down my take on the approach before having it polluted by outside information, and right after we've needed to do one at work.

When do I do a meltdown analysis ? When we've gotten into a bad situation involving multiple people that we want to learn from and avoid next time.

Why do we do a meltdown analysis ? While a meltdown is sad, you should consider it a big opportunity to get real-world failure cases that you should be able to deal with. You've already had all of the negative impact; try and get as much positive value out of this bad event as possible to balance out your work karma.

What do I do in the analysis ?

Establish a timeline based on facts
Agree on the timeline with all involved parties
Analyze each item on the timeline and brainstorm on all problems in this item
Review and create a set of rules/guidelines/process changes that should avoid each problem
Create practical TODO's from this list
Review later on to make sure the decisions have been properly applied and incorporated

So, let's dig in a little deeper.

What

Not every problem is a meltdown and warrants a thorough analysis. Indicators for considering a problem a meltdown include:

multiple mistakes were made, and not making a subset of them would have avoided the entire problem
actions taken to correct the problem made the problem worse
multiple people were involved
bad communication before and during the operation
multiple customers were affected
responsibilities are unclear

When

The first steps of the analysis should be done ASAP. The timeline should be established with memories of the problem as fresh as possible. Try and clear two-three hours out of people's schedule the day after to get this done. Insist on them adjusting as necessary and that everyone has to sacrifice time equally for this to happen. This phase is critical for all the other phases.

How

Timeline
Have a meeting ASAP after the incident. Memory of the events should be fresh. Telephone call logs should not be overwritten yet. (Don't laugh, one timeline session was hindered immensily by the fact that one person's Blackberry already did not show the times of phone calls anymore from around the problem) System logs of machines should still have all info.
Tell people not to prepare for the meeting; the idea is to collect as many raw facts as possible without any kind of bias. Once they start preparing they start thinking and looking up things and filtering their memories or even outright changing them to avoid blame.
People should bring in tools like laptops and so on that can assist in the detective work of establishing the timeline. (For example, it helps immensily to look up times when our monitoring system detected certain problems) However, don't let people wander on their laptops.
Establishing a timeline is a breadth-first approach - it is important to get the whole timeline done ASAP before investigating each item.
During this phase any discussion about responsibility, blame, or solutions is to be avoided completely. This phase is a collaborative effort where everyone works together to identify the actions taken by everyone in the group.
Make sure to separate verifiable facts from hunches/recollections/vague timing. You will notice that people will change their mind on when things happen or are much better at remembering relative order of events than actual times.
It helps to have a 'clockmaster' column for each item - usually clocks used to time events aren't synchronized. My mobile phone doesn't necessarily show the same time as the main server's log file. Separating this allows you to later on adjust to a common time after figuring out how much mobiles, watches and/or systems are out of sync with each other.
Some of the items on this list will need further verification or lookup after the meeting; make sure to allow for this.

In short, this phase should be as short as possible, establishing a complete time line of verified actions, with as much transparancy and as complete as possible.

Problem analysis
This phase is more of a brainstorm and discussion phase.
The goal is to go over each item in the timeline, and take the time to brainstorm on all the problems all participants see with this particular item. Here you go depth-first, looking at each item and discussing it to everyone's satisfaction until no new problems are seen.

For example, in our particular case, one item on the timeline reads: "X rolls back the deployment."

Based on that item, the following problems were detected:

a rollback was done without verifying if the alarms in the monitoring system actually affected the service (in our actual case, most of the alarms after deployment were triggered because the actual check was broken because of the deployment)
a rollback was done without checking if the services affected were services for actual customers (we also have internal services)
a deployment that affects the monitoring system should always be separate from other deployments (same reason as before)
rollback was (wrongly) performed using rpm -e which stopped services

Break
Usually it's a good idea to stop the meeting after this part. You're probably past the two hour mark by now, and you have a bunch of facts you still need to look up or verify, and people might want to think about more problems related to the timeline on their own.

Send people a summary of the timeline and problems up to now, then plan a follow-up meeting.

Solutions

After the first phase, get together and propose solutions and guidelines that would have helped avoid each of the problems.

It is important to consider actions for *all* of the problems; often someone says 'if we just would have not done the first thing, none of this would have happened'. Resist the urge to stop there; the goal of this exercise is to learn from all of the mistakes that have been made, to plug as many holes in your processes as possible, and to create processes that are resistant to single failures.

In this phase, the idea is to consolidate all of the different process changes and solutions into a coherent set. You try to minimize the number of changes to your process in such a way that you cover the whole set of detected problems in a way you all agree on.

TODO

After establishing a set of solutions and changes to avoid all the problems, the meeting should change from introspective to operative. Based on the list of solutions, a TODO list should be assembled so that it is clear for each participant which points should be resolved by him or her. TODO items could be anything from reviewing and updating a wiki or some documentation to taking actual corrective steps to setting up meetings to define new processes.

This part should be simple and productive.

review

The worst that can happen is that a meltdown you have already suffered repeats itself. That is, unless you are actually fine with the meltdown scenario happening - in which case it wasn't a 'meltdown' candidate to begin with.

To this end, it is important that you follow up on the practical TODO items you created.
If any of them did not get done, make sure you understand why. If the person changed their mind on the use of the TODO item, figure out if he's right or not. If he's right, you should re-think the solution that was being implemented in that TODO and the problem it was trying to solve.

Usually people forget to implement everything for the millions of reasons people forget to do or postpone things every day. Make sure you keep the ball rolling towards a point where you can consider the meltdown completely handled.

Those who cannot remember the past are condemned to repeat it.

These meltdown documents should also serve as an excellent introduction to newcomers in the company on what kind of serious problems you've ran into, how they have been handled, and why certain decisions were made.

Do you do something similar in your job ? Feel free to share your opinion. Meanwhile I'll go look over the previous meltdown notes and verify if we actually implemented everything we said we'd do.

Comments (0)

Getting Things Done

Filed under: Hacking,Life,Work — Thomas @ 10:50

10:50

GTD has saved my professional bacon on a number of occasions. There's lots of reasons why this methodology of handling your tasks has fit me well, and I'm sure they're different for different people. For example, one thing that it definitely has helped me with is helping me make good use of those days where I'm lacking energy, creativity, and/or just generally feel too tired to be productive. On those days, I work through tasks strictly by the book, picking those that even in my lower productivity I can manage just as well. In fact, I tend to save those tasks for those days, making sure I don't waste my high energy moments on them.

I know I could do much better at following the system, but it is definitely something that is already paying off in its current form.

One thing I felt I was lacking though was a way to measure my progress in getting through these tasks, as well as my INBOX. I wanted to add a game element to it that would challenge me to stick to the process. Especially the zero INBOX policy is one that is easy to lose on if you let your guard down.

So, in my little universe, gaming means graphing. After some futzing about with scripts that - sadly - go into Evolution's IMAP cache dirs to count mails in inboxes and folders I specifically keep for GTD stuff (apparently evolution python bindings don't allow you to ask Evolution for mails in your folders), as well as grepping my todo.txt (managed by yagtd), and setting up some RRD files and scripts, I now have halfway-decent graphs:

The first one shows my inbox in my two main mail accounts (work and private). The second image shows how many tasks I have in each 'urgency' level (Urgency and Importance are two concepts from Stephen Covey's book that yagtd incorporates into the GTD stuff). Roughly, for me personally, U:5 is 'today', 4 is 'this week', 3 is 'this month', 2 is 'these 3 months', and 1 is 'this year/this life'. 0 is the next life. (I realize that this may be frowned upon by GTD adepts; feel free to share why if you are doing any frowning).

I had to put the second graph on a logarithmic scale to make sense, otherwise the more pressing tasks (U:5) would hardly show up. It's not ideal; I'd prefer the axis to scale differently somehow but I don't know yet how I want them.

Anyway, these scripts give me a nice goal to work for, and some numbers to fight against and help me decide whether I should spend the next weekend slacking or hacking.

On the bad side, this made me realize I have over a 1000 identified open tasks!
On the good side, when I told a friend about this, he said 'See, there's the difference between men and women. If my girlfriend would realize she'd have a 1000 unfinished tasks, she'd go berserk.' Sexist ? Surely. True ? Possibly, statistically speaking. Motivating ? Definitely - the fact that I actually have all these things identified allows me to sleep at night (I can't believe I used to try and keep all this stuff in my head), and I'm convinced I'll never lack for things to do.

Comments (8)

NetworkManager confusion

Filed under: Hacking — Thomas @ 17:51

2009-02-23
17:51

We'd been getting a long for a while, but now I'm just really confused again.

Since a while, my home machine boots up setting its own IP address as the default gateway. I have no idea why.

After poking at it for the nth time, I noticed that choosing 'Auto Ethernet' works (gets IP and gateway from DHCP) and 'System eth0' does not.

Here's the message spew for the working one:

Feb 23 17:39:56 ana NetworkManager: (eth0): device state change: 8 -> 3
Feb 23 17:39:56 ana NetworkManager: (eth0): deactivating device (reason: 0).
Feb 23 17:39:56 ana NetworkManager: check_one_route(): (eth0) error -34 returned from rtnl_route_del(): Sucess#012
Feb 23 17:39:56 ana avahi-daemon[3122]: Withdrawing address record for 192.168.1.12 on eth0.
Feb 23 17:39:56 ana avahi-daemon[3122]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.1.12.
Feb 23 17:39:56 ana avahi-daemon[3122]: Interface eth0.IPv4 no longer relevant for mDNS.
Feb 23 17:39:56 ana NetworkManager: Activation (eth0) starting connection 'Auto Ethernet'
Feb 23 17:39:57 ana NetworkManager: (eth0): device state change: 3 -> 4
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 1 of 5 (Device Prepare) scheduled...
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 1 of 5 (Device Prepare) started...
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) scheduled...
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 1 of 5 (Device Prepare) complete.
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) starting...
Feb 23 17:39:57 ana NetworkManager: (eth0): device state change: 4 -> 5
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) successful.
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 3 of 5 (IP Configure Start) scheduled.
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) complete.
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 3 of 5 (IP Configure Start) started...
Feb 23 17:39:57 ana NetworkManager: (eth0): device state change: 5 -> 7
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Beginning DHCP transaction.
Feb 23 17:39:57 ana NetworkManager: dhclient started with pid 5969
Feb 23 17:39:57 ana NetworkManager: Activation (eth0) Stage 3 of 5 (IP Configure Start) complete.
Feb 23 17:39:57 ana dhclient: Internet Systems Consortium DHCP Client 4.0.0
Feb 23 17:39:57 ana dhclient: Copyright 2004-2007 Internet Systems Consortium.
Feb 23 17:39:57 ana dhclient: All rights reserved.
Feb 23 17:39:57 ana dhclient: For info, please visit http://www.isc.org/sw/dhcp/
Feb 23 17:39:57 ana dhclient:
Feb 23 17:39:57 ana NetworkManager: DHCP: device eth0 state changed normal exit -> preinit
Feb 23 17:39:57 ana dhclient: Listening on LPF/eth0/00:1d:7d:04:a2:74
Feb 23 17:39:57 ana dhclient: Sending on LPF/eth0/00:1d:7d:04:a2:74
Feb 23 17:39:57 ana dhclient: Sending on Socket/fallback
Feb 23 17:39:58 ana ntpd[2930]: Deleting interface #12 eth0, 192.168.1.12#123, interface stats: received=0, sent=4, dropped=0, active_time=38 secs
Feb 23 17:40:00 ana dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Feb 23 17:40:00 ana dhclient: DHCPOFFER from 192.168.1.1
Feb 23 17:40:00 ana dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Feb 23 17:40:00 ana dhclient: DHCPACK from 192.168.1.1
Feb 23 17:40:00 ana dhclient: bound to 192.168.1.12 -- renewal in 9409 seconds.
Feb 23 17:40:00 ana NetworkManager: DHCP: device eth0 state changed preinit -> bound
Feb 23 17:40:00 ana NetworkManager: Activation (eth0) Stage 4 of 5 (IP Configure Get) scheduled...
Feb 23 17:40:00 ana NetworkManager: Activation (eth0) Stage 4 of 5 (IP Configure Get) started...
Feb 23 17:40:00 ana NetworkManager: address 192.168.1.12
Feb 23 17:40:00 ana NetworkManager: prefix 24 (255.255.255.0)
Feb 23 17:40:00 ana NetworkManager: gateway 192.168.1.254
Feb 23 17:40:00 ana NetworkManager: nameserver '192.168.1.1'
Feb 23 17:40:00 ana NetworkManager: nameserver '157.193.40.42'
Feb 23 17:40:00 ana NetworkManager: domain name 'amantes'
Feb 23 17:40:00 ana NetworkManager: Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled...
Feb 23 17:40:00 ana NetworkManager: Activation (eth0) Stage 4 of 5 (IP Configure Get) complete.
Feb 23 17:40:00 ana NetworkManager: Activation (eth0) Stage 5 of 5 (IP Configure Commit) started...
Feb 23 17:40:00 ana avahi-daemon[3122]: Joining mDNS multicast group on interface eth0.IPv4 with address 192.168.1.12.
Feb 23 17:40:00 ana avahi-daemon[3122]: New relevant interface eth0.IPv4 for mDNS.
Feb 23 17:40:00 ana avahi-daemon[3122]: Registering new address record for 192.168.1.12 on eth0.IPv4.
Feb 23 17:40:01 ana NetworkManager: (eth0): device state change: 7 -> 8
Feb 23 17:40:01 ana NetworkManager: Policy set 'Auto Ethernet' (eth0) as default for routing and DNS.
Feb 23 17:40:01 ana NetworkManager: Activation (eth0) successful, device activated.
Feb 23 17:40:01 ana NetworkManager: Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.
Feb 23 17:40:01 ana ntpd[2930]: Listening on interface #13 eth0, 192.168.1.12#123 Enabled

And the spew for System eth0, the non-working one:

Feb 23 17:40:11 ana NetworkManager: (eth0): device state change: 8 -> 3
Feb 23 17:40:11 ana NetworkManager: (eth0): deactivating device (reason: 0).
Feb 23 17:40:11 ana NetworkManager: eth0: canceled DHCP transaction, dhcp client pid 5969
Feb 23 17:40:13 ana NetworkManager: check_one_route(): (eth0) error -34 returned from rtnl_route_del(): Sucess#012
Feb 23 17:40:13 ana avahi-daemon[3122]: Withdrawing address record for 192.168.1.12 on eth0.
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) starting connection 'System eth0'
Feb 23 17:40:13 ana NetworkManager: (eth0): device state change: 3 -> 4
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 1 of 5 (Device Prepare) scheduled...
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 1 of 5 (Device Prepare) started...
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) scheduled...
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 1 of 5 (Device Prepare) complete.
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) starting...
Feb 23 17:40:13 ana NetworkManager: (eth0): device state change: 4 -> 5
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) successful.
Feb 23 17:40:13 ana avahi-daemon[3122]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.1.12.
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 3 of 5 (IP Configure Start) scheduled.
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 2 of 5 (Device Configure) complete.
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 3 of 5 (IP Configure Start) started...
Feb 23 17:40:13 ana NetworkManager: (eth0): device state change: 5 -> 7
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 4 of 5 (IP Configure Get) scheduled...
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 3 of 5 (IP Configure Start) complete.
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 4 of 5 (IP Configure Get) started...
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 5 of 5 (IP Configure Commit) scheduled...
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 4 of 5 (IP Configure Get) complete.
Feb 23 17:40:13 ana NetworkManager: Activation (eth0) Stage 5 of 5 (IP Configure Commit) started...
Feb 23 17:40:13 ana avahi-daemon[3122]: Interface eth0.IPv4 no longer relevant for mDNS.
Feb 23 17:40:13 ana avahi-daemon[3122]: Joining mDNS multicast group on interface eth0.IPv4 with address 192.168.1.12.
Feb 23 17:40:13 ana avahi-daemon[3122]: New relevant interface eth0.IPv4 for mDNS.
Feb 23 17:40:13 ana avahi-daemon[3122]: Registering new address record for 192.168.1.12 on eth0.IPv4.
Feb 23 17:40:14 ana NetworkManager: (eth0): device state change: 7 -> 8
Feb 23 17:40:14 ana NetworkManager: Policy set 'System eth0' (eth0) as default for routing and DNS.
Feb 23 17:40:14 ana NetworkManager: Activation (eth0) successful, device activated.
Feb 23 17:40:14 ana NetworkManager: Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.

I had never actually tried to make sense of what Auto Ethernet and System eth0 mean and why you would want to choose one over the other (major usability fail to allow the user to choose between two seemingly meaningless options). It seems that, when I configure connections, I can actually see System eth0 to be configured with that fixed gateway of 192.168.1.12, but everything's grayed out, and according to NetworkManager I 'never' used the profile, which is of course a lie.

But at least I can now go on and browse some text config files for that gateway.

Anyone know what System eth0 actually is ? According to some posts it looks like a config that NetworkManager automatically generates the first time it runs.

Comments (3)

« Previous Page — Next Page »

Present Perfect