thomas.apestaart.org

Filed under: General — Thomas @ 12:20

2003-12-31
12:20

I finally got back a replacement SCSI drive for the external RAID system we have. Can you believe that ? Two full months. I've had the RAID shut off for most of it. Honestly, what's the point in having a RAID if you can't replace bad disks fast enough ?

In any case, as soon as I got it back, I stuck it in the free slot and turned on the machine. Then the beeping started again. I shut off the alarm in the card's bios then tried to rebuild the disk(1) array. It started doing so, then stopped at 0% and went off beeping again. Sigh.

What's a responsible sysadmin to do in a case like that ? Not tell his boss and spend the rest of his day challenging fate and defying odds, that's what. You know, sometimes computers really are more a case of black art and intuition than anything even closely resembling engineering practices or science.

As luck would have it, I happened to have an external enclosure with *FOUR* disk slots, and I use only three of them. So I still had the top slot free. Being in a really illogical state, ready to try anything, I decided to stick the drive in the top slot instead of the bottom one.

Hm, it didn't fit. Strange. Try an empty disk holder, that works. Try with the drive again, still doesn't work. Ok, try to take a drive out of slot 2 and stick that in slot 1. That works. Hm, strange. Take out drive 1 from slot 2 again, insert new drive in slot 1, doesn't work. Spend about twenty minutes physically comparing both the disk drive and the disk holders. They look identical. Mount new drive in holder 1 from slot 2 and try in slot 1. Doesn't work. Drive 1 in holder 3 from slot 4 works though. This isn't making sense at all. Especially since I can really feel that with the new drive, it slides in all the way except for the last bit and the latch at the front of the disk array that needs to go in a clamp-style thingy can't reach it.

Re-swap drives and holders, jiggle the SCSI connectors a bit since they look rather wobbly (don't try this at home), and try again. Now it slides in and it works. Anyone care to explain ? ;)

So, back to BIOS, turn off beeping again, ask for a rebuild, and after five very long minutes the 0% counter reaches 1%. Now I just have to hope that by the time it is rebuilt, the array will have been smart enough to do everything RAID is supposed to do ... and THEN I can tell my boss what happened, and assure him that everything's fixed. Phew ...

(1) Can anyone tell me why we use "disk" for hard drives and floppy disks but "disc" for cd-rom's, dvd's, cd-rw's and frisbees ?

Comments (2)

Filed under: General — Thomas @ 12:19

12:19

Prologue : A meeting room at Adaptec, two years ago. Engineers are discussing the design of the 2100S Raid controller...

Engineer 1 : wouldn't it be great if we put a beeper on the card to warn people when one of the disks fails ?
Engineer 2 : yeah, great idea. We can charge more for the card then, and people will be really happy when it fails because they'll be able to tell immediately !

Happy engineers leave the meeting room, congratulating eachother on a job well-to-be-done.

So I arrive at work this morning. I am there for all of, oh, five minutes, and a really annoying beeping sound makes itself heard. I look around trying to figure out which of the ten PC's in my immediate vicinity is making the sound. It seems to come from one of the server machines. Yep, it's the big server with the external storage unit (of which, if you read through other entries of mine, you might know that the external storage unit is connected through an internal SCSI cable to the inside of the server machine).

What seems to be the matter ? One of the LEDs on the storage unit is flashing. I start looking for manuals on the storage unit, but it doesn't help me further. Meanwhile, the beeping is hurting my ears, but I don't want to switch off the machine until I see there's no other option.

I quickly give up and turn off the machine anyway, as a result of peer pressure. Good thing it only started beeping after I arrived. I don't dare to think what people would have done at six o'clock to get rid of the sound - or, worse yet, at the start of my week-long holiday the day after tomorrow.

Hm, ok, so judging from the BIOS, one of the three disks has failed. No matter, it's a RAID. I can backup the drive and ask for a replacement, right ? Let's see, when did I buy this unit ?

Hm, too bad. Exactly one year and two weeks ago. The drive s have a warranty of one year. That sucks ...

Ok, they're also pretty expensive. Hm, how am I going to bring this up with my boss ?

Meanwhile, I still need to backup. The Adaptec site mentions how to turn off the alarm in about six articles, so I'm guessing they've had complaints about that bad design decision (TM) in the past. Only, the site mentions a command-line utility which I have, but it doesn't work. I run raidutil -h first but that doesn't do much. It should print help info, no ?

I check the man page. raidutil -h creates a hot spare. oops. Luckily I didn't supply arguments, who knows what might have happened ;( The man page itself is messy and plain wrong. It mentions a -a argument and contains the world "alarm" there twice, but all of the arguments mention other actions. From what I gather from the six Adaptec articles, the alarm should be set using -A, and there is also a -a option for other stuff, so the man page seems to mix both up together ;( Talk about bad quality control.

Anyway, none of the options seem to work. Probably having upgraded the machine from RH62 to RH72 has something to do with it, since the Adaptec tools are probably geared towards the previous kernel. Maybe the utilities can't speek dpt-ese at the moment.

So I reboot with the Adaptec bootable CD. The card beeps again, of course. It's a custom Red Hat boot cd. It starts by autoprobing my video card and then starts X. Surprise surprise, X is botched up.

By now I'm pretty pissed and I decide I'll get this fixed no matter what. I'll spare you the details, but poking around the CD-ROM allowed me to get X configured properly in 800x600. So I start X and adaptec's software starts up in a really ugly window manager I seem to vaguely remember from the dark ages.

The software tells me I have a failed disk. Yes, but why ? Luckily I can turn off the alarm.

So what know ? Make backups ? It's 60 GB. Probably not a good idea to do that over the network, but let's try, just for fun. Ok, so there is a module for my network card, but no network tools like ping, and it doesn't seem like the network card wants to work with the driver. So this isn't going to work out either.

Shut down, attach a spare IDE drive to copy stuff to, reboot, repeat process, open terminal. Hm, what device is the RAID system now ? It used to be /dev/sdb. But that's my main SCSI disk now. and the CD-ROM is /dev/sda for some reason. And I cannot find the raid controller anywhere else. And there's no dmesg or /var/log/messages output. It's probably been turned off in this custom kernel.

By know I feel I could really make good use of a shotgun. Luckily, I don't have one.

*SIGH* Check drivers and software on the site, download RPM's for older RedHat versions, restart the machine, try to install them (during the beeping of course), debug the messages from console to get a clue of what is going wrong, and FINALLY connect to the adaptec storage manager and be able to turn the bloody beeping OFF. By this time I've already spent quite a few hours trying to do this the right way and my mood has gone to an all-time low this year because of it. I really need that holiday !.

So right now I'm copying all of the data of the drive (well, almost all - I have 50 GB of free space and 60 GB of data to copy and I'm deciding who to piss off by taking risks with their data) and I'm composing a mail to the hardware vendor trying to persuade them to take the back drive under the warranty.

On to better news : I've experimented with GTK+ this weekend. I wanted to code a better panel application for Dave/Dina, to be used with the Infrared Controller. Since I'm a GTK+ newbie, I started out with GTK+ 2. Might as well try to get it right from the start.

I must say it was a pleasant experience. After the pains of messing about with Xlib for another project of mine, Xmsgd, this was fairly easy to code. I have a small demo application that shows how the panel would react. I'll probably put it on-line soon at http://davedina.apestaart.org/ so other people can comment on the UI. It's bare-bones, but it'll be functional until some GTK wizard helps out in development ;)

Meanwhile, I'm getting patches for some of my projects and it is really satisfying to be a part of this invisible open-source network.

Oh, and I forgot to mention that GStreamer released 0.3.1 last week ! We're setting up guidelines for 0.4.0 at the moment. And Wim wrote a new capabilities negotiation system (the thing that makes plugins decide if they want to

talk to each other).

And now I need to blow off some steam ...

Comments (2)

Filed under: General — Thomas @ 12:18

12:18

I finally got out a first release of columbus. If you have a laptop or a computer you take with you to various places and you're sick of having to somehow manually change your system configuration, then help me out in improving columbus.

Columbus is simple, a bit hackish, but rather elegant. It has worked flawlessly on my laptop for the past two weeks. I take out my network cable at work, put the laptop to sleep with apm, come home, wake it up and plug the cable back in, and presto : everything works the way it should. Right IP (without DHCP), right hosts file, right hostname, ...

All it needs is a few MAC and IP addresses and some configuration files to actually use.

So try it out and let me know if it works for you.

Comments (2)

Filed under: General — Thomas @ 12:17

12:17

You all could have told me I had been hacked, or rather, rooted, last week. What good is a weblog among the stars if the stars don't talk ?

Anyway, I should have known right from the start that the system acting funny would indicate something bigger. Everything has been re-installed now and my boss has seen the need for good intrusion denial and detection. So I installed and took the time to know some decent intrusion tools. I'd tell you which if my Hacking Linux Exposed book would allow it. But it would open me up to social hacking. Even mentioning the book is bad because then people know I probably installed tools mentioned in that book(*)

So, year++. Let's get GStreamer officially accepted world-wide. At the moment we're deciding on the right canonical name for packages and tarballs after the module split. Things start to build nicely again, I have a build script running on a few machines which even builds fresh RPM's daily if possible. We might actually be doing something resembling quality control real soon. But let's decide on the names, it's stalling us.

As for Dave/Dina, this is the year I want to see it happen. I'm going to fix up some stuff then release some form of a distribution and try my best to attract new developers. I've laid down the groundworks ok, now it's time to get it rolling.

(*)Except I didn't. Heh.

Comments (2)

Filed under: General — Thomas @ 12:16

12:16

My boss is on holiday. He had set up my schedule so that I could program on one thing for three days. Ah, the cruel hand of fate !

I arrived this morning to find one of our servers with a cryptic Unable to load interpreter /lib/ld-linux.so.2. And you thought stuff like this only happened on Windows (which I refuse to spell incorrectly, btw).

Now, I've seen this error before, and mostly it happened when too many processes were running at the same time and memory space was exhausted - which shouldn't be happening anyway, but it does sometimes. Mostly when I try out a new monitoring tool which goes into a frenzy because of an NFS or samba mount error. But it has always happened when I was still logged in on a terminal somewhere so it was, with some effort, fixable. But no such luck in this case. Since I've made an effort to log out any root terminals left open on the servers I now find myself stuck.

The server is in heavy continuous use : all of our journalists write their pieces in their Explorers, and news is read every half an hour during the day. So I can't afford much downtime. But a quick reboot might do the trick, right ? I can't hack into my own system, so it's my last option. I will not learn what caused it, but at least it'll get on.

Well, no, actually. On a fresh reboot, I couldn't log in either. This was getting worrysome. I quickly tried booting in single user mode and that worked. I downloaded newer glibc packages (which include this problematic library) and rebooted, but still the same. I was getting nervous.

What's a guy to do in a situation like this ? Well, either you panic and start fixing it like mad, or you look and see how any possible solution can be fitted into the bigger plan. Let's see. First of all, the server still works. Only we cannot log in, which is problematic for the import of playlists, as well as for the newsletter I spent enough time on already lately (about which you can also read in this diary ;) ). So people are sure to complain.

Well, the long-term plan was to upgrade a few servers here anyway. I installed them when I was young and fresh at the job, and I might have dropped the ball here and there. I know for sure I installed a tarball too many and a few alien RPMs as well. People sometimes complain about rpm, but here too, it is a case of poor workmen blaming their tools. I used to blame it as well, but now I know that if you heed the warnings rpm gives you, and if you stick to RPM's on an rpm-based system, you're safe.

The thing is, I'll never be able to shut it down for the hour or so I'll need to upgrade or reinstall it, so how do I fix that ? Well, the answer should be easy. Upgrade one of the other, less critical servers (that dual-processor machine with RH70 and a 2.2 kernel that keeps crashing anyway for example) and let that take over for a day or so. So I started upgrading that one. And being the reckless idiot that I am, I thought it was a good idea to finally find a solution for the opened cases problem.

This problem being the one where both the server case and the external storage case need to be left open because I could only get a connection to work between them using an internal twisted SCSI cable. Yes, you are allowed to laugh. This would actually be a good idea since last year during winter it started snowing inside the building through a hole in the roof and the next morning I arrived to find a small puddle of water on the carpet ten centimeters next to the drive array. The opened drive array.

Anyways, this wasn't such a good idea : I tested various SCSI cable connections, but my terminator still lit up green while my hunch was it was supposed to lite red, judging from the other two terminators on the tape streamers.

I spent three hours working on that - my boss will love that when he returns ;) - and gave up. I did run the internal cable through holes at the back of both the server and the drive case.

So I considered if I should install fresh or upgrade. Someone convinced me to upgrade, even though I wasn't too happy with the drive layout. So I did, and of course the new kernel didn't boot. Probably because we can't expect Red Hat to compile in each and every raid controller into the kernel. So here I am, booting from another older kernel (in that respect, GRUB is great), downloading the new kernel updates and recompiling a kernel to match my system.

Let's see if this time I at least get my kernel configuration right the first time around. That's what experience should do for a person, right ? *SIGH*

Comments (2)

« Previous Page — Next Page »

Present Perfect

2003-12-3112:20

12:19

12:18

12:17

12:16

2003-12-31
12:20