Present Perfect


Picture Gallery
Present Perfect

Recovering from a lost /var on Fedora/Red Hat/CentOS

Filed under: Fedora,sysadmin — Thomas @ 8:28 pm

8:28 pm

Last week, after upgrading my home desktop to F11, I had palimpsest tell me one of my disks was broken on the desktop machine. The desktop is running on two 250 GB drives in software raid. It was time to get new drives.

After a weekend of fiddling with new 1 TB disks for my home desktop, trying failure scenarios, making sure the system can boot from each of the two drives, and waiting for the 4 hour resync of the software RAID in between each step, I finally closed up the desktop machine and cleaned up under my desk again, thinking I was done with my halfyearly messing about with broken disks.

I guess I was tempting faith anyway. Doing a routine operation on my home server after all the configuration stuff I’d done to set up asterisk last week, suddenly an rsync aborted, a journal errored out, a partition changed to being mounted read-only, and the log was full of scary drive errors. Ouch.

Well, that’s why I keep around a big box of old drives – for when some drive fails and I want to tempt fate even more by reusing an old drive that’s probably going to fail real soon too. And anyway, I had just spent my hard drive piggybank on the new desktop drives.

Luckily, I seemed to have a 400 GB SATA drive lying around that used to belong to my media center. I don’t remember why I swapped it out, given that the media center has a 160GB drive for the OS (and two 1.5 TB raid drives for the data, of course), but this was a lucky break. I booted with a rescue cd, and tried copying the root filesystem of my CentOS 5.2 home server partition to this new drive. Which worked fine, except that /var was where I triggered an Input/Output error and some more drive errors in the kernel log.

So, powered off, took out the broken drive, and put it in a USB chassis. The advantage of a USB chassis is that you can easily just replug the drive to try again, instead of locking up your system terribly and having to reboot. Sadly, /var was broken beyond repair. I ran an e2fsck hoping to recover the contents, and that partly worked, but some of the important stuff is missing even from lost+found (apart from the annoying situation where you have to reconstruct file names, which I usually end up not bothering with).

But really, how important can /var be ? Turns out, rather important. As in, you need it to boot in the first place. And also, it holds your rpm database. Crap.

Some Googling gave me some posts on how to reconstruct your rpm database from log files (using –justdb –noscripts –notriggers). But to use those, you actually need those log files. Where are those ? On /var as well. Crap. And they’re not in lost+found either.

Ok, so time to get creative. Here’s what I ended up doing:

  • create /var/lib/rpm, and run rpm –rebuilddb to end up with an empty rpm database
  • Based on the contents of /usr/bin, figure out what packages ought to be installed:

    rpm -qf /etc/* | grep 'not owned' | cut -f2 -d' ' > /tmp/unowned
    yum --enablerepo=c5-media --disablerepo=base --disablerepo=updates --disablerepo=addons --disablerepo=extras whatprovides `cat /tmp/unowned` | cut -f1 -d' ' | sort | uniq > /tmp/missing
    yum --enablerepo=c5-media --disablerepo=base --disablerepo=updates --disablerepo=addons --disablerepo=extras install `cat /tmp/missing`

    This works by first listing all files that are not owned by rpm (on the first run, that’s all of them), figure out what packages can provide these files, then installing those packages.

  • Repeat the process for other important directories, like /bin, /sbin, /usr/sbin, /usr/lib, /usr/include, …
  • Clean up .rpmnew files that don’t actually contain differences:

    find / -name *.rpmnew | sed s/.rpmnew//g > /tmp/rpmnew
    for c in `cat /tmp/rpmnew`; do echo $c; diff $c $c.rpmnew && mv -f $c.rpmnew $c; done

  • Same for *.rpmorig:

    find / -name *.rpmorig | sed s/.rpmorig//g > /tmp/rpmorig
    for c in `cat /tmp/rpmorig`; do echo $c; diff $c $c.rpmorig && mv -f $c.rpmorig $c; done
  • Inspect the remaining ones, and merge changes.

While it’s not an experience I hope to repeat any time soon, it worked out surprisingly well!

1 Comment »

  1. I have done this. This is why my machines now have a cron job that basically does “rpm -qa” and backs up the output.

    Comment by B — 2009-6-25 @ 3:17 am

RSS feed for comments on this post. TrackBack URL

Leave a comment