deleting backups |
2007-06-11
|
backups are good.
Over the years I've come to realize this simple mantra, to the point of thinking, every time I put in place a new service - "how do we back this up ?"
Usually the answer is relatively simple. You svndump an svn repository then tar/bzip2 it. You dump a trac setup and tar/bzip2 it along with the config files. You rsync over maildirs (not entirely correct but good enough). You add this as a daily cron job. And so on.
This isn't perfect, though. Sooner or later you will have to deal with the swaths of disk space your backups are wasting. And really, do you still need the svn dump from two years ago on Monday if you also have Sunday and Tuesday ?
So really, what you want is something more like "only keep the daily backups for three months, and after that only keep a weekly backup, or one every x days".
I searched for ages among various find-like tools to make this possible from a shell script, and never found anything useful. Two weeks ago I decided I'd just write it in python, and it turns out it's a lot simpler than I was fearing it would be:
#!/usr/bin/python
import glob
import stat
import os
import time
files = glob.glob('*')
for file in files:
keep = False
s = os.stat(file)
mtime = s[8]
# keep if it is from a sunday
anyGivenSunday = time.mktime((2007, 5, 6, 0, 0, 0, 0, 0, 0))
secondsPerDay = 24 * 60 * 60
if (mtime - anyGivenSunday) % (7 * secondsPerDay) < secondsPerDay:
keep = True
# keep if it is younger than 3 months
now = time.time()
if now - mtime < 90 * secondsPerDay:
keep = True
print file, mtime, keep
This script prints out one line per file from the current directory, with True in it if the file should be kept.
So typically I run this as
keep-sundays | grep False | cut -d' ' -f 1
to see the list, and then add "xargs rm" if the list makes sense.
Next step would probably be to refine this a little, add some arguments, and put in a cron job, but for now it solves the problem of weeding out my backups and free some disk space on our servers after 3 years of backups.
How about http://rdiff-backup.nongnu.org/ ? It even does remote backup over ssh!
I use it in cron.daily like this:
BACKUP_SRC=”/home /var/www /var/svn /var/lib/mysql-backup /etc /root”
BACKUP_DST=”/export/backup/localhost”
BACKUP_CMD=”/usr/bin/rdiff-backup –verbosity 1″
for SRC in $BACKUP_SRC ; do
DST=”${BACKUP_DST}${SRC}”
if [ ! -d ${DST} ] ; then
echo “Ai, DST (${DST}) does not exist. (creating it).”
mkdir -p ${DST}
fi
echo “– Backing up ${SRC}..”
$BACKUP_CMD $SRC $DST
$BACKUP_CMD –remove-older-than 14D –force $DST
done
Comment by ccsalvesen — 2007-06-11 @ 14:53
I use rsnapshot (www.rsnapshot.org) for exactly that purpose.
You can rsync+ssh remote sites and you can use custom scripts, rsnapshot manages the daily/weekly/monthly rotation of your backups. It uses hard links to files that haven’t been changed since your last backup to save diskspace.
I’ve been using it for several years now and it’s a lifesaver! You should definately check it out if you haven’t already :)
Comment by Dennis Krul — 2007-06-11 @ 15:15
Have you checked faubackup? It does what you want, and in addition, hardlinks all the files that haven’t changed between the snapshots. To give an example of a real case, I’ve got one machine backing up >50GB daily, but only around 100MB of it actually gets added or changed – so only that 100MB is not hard linked.
Basically it makes keeping snapshots for the last 7 days, one weekly for the last 4 weeks, and one monthly a breeze. And without wasting huges swathes of disk space.
Comment by nona — 2007-06-11 @ 15:21
@ccsalvesen: it looks like yours will just delete everything older than 14 days. Any way to make it delete most of the old archives, but not all, the way my script does ?
@dennis and nona: I have used dirvish in the past (mostly at home), and it served me well, hardlinking as well. Though the directories become slow to traverse. But in this particular case, I want to manage backups that basically are one .tar.bz2 of a whole directory and I need to be able to go back in time and unpack/inspect them. Dirvish serves a different purpose, and it looks like rsnapshot and faubackup also operate on trees, and don’t manage rotation of a single archive/file over time. Am I missing something ?
Comment by Thomas — 2007-06-11 @ 15:44
Short and sweet, but make sure you’re not backing up any files with “False” in their name.
Comment by rephorm — 2007-06-11 @ 19:43
You need to learn how to use the datetime module;
import datetime
isSunday = datetime.date.today().weekday() == 6
The 3 months thingie is easiest to to using the external dateutil package.
Comment by Johan — 2007-06-11 @ 21:45
@Thomas: you’re right. I missed your point.
Take two:
Every day, back up to current.tbz2. Then do:
cp current.tbz2 $(date ‘%Y’).tbz2
cp current.tbz2 $(date ‘%m’).tbz2
cp current.tbz2 $(date ‘%d’).tbz2
You’ll keep one for each of the last 30-something days, one for each of the last twelve months and one for each year.
Not that much waste, automagic overwriting and dead simple. Woohoo.
Comment by ccsalvesen — 2007-06-11 @ 22:30
And a better way:
keep-sundays | awk ‘/False/{print $1}’
Comment by Jeff Schroeder — 2007-06-12 @ 05:47
@thomas: Just use the backup_script command like this:
# backup_script [command] [destination subdirectory]
backup_script “/usr/bin/tar cfjv backup.tar.bz2 /home/thomas” home_thomas/
The backup script is executed with the destination directory as the current working directory. Rsnapshot takes care of the rotation and cleaning up old backups.
Your directory layout will be something like this:
/srv/rsnapshot/daily.0/home_thomas/backup.tar.bz2
/srv/rsnapshot/daily.1/home_thomas/backup.tar.bz2
/srv/rsnapshot/weekly.0/home_thomas/backup.tar.bz2
/srv/rsnapshot/monthly.0/home_thomas/backup.tar.bz2
You get the idea.
Comment by Dennis Krul — 2007-06-12 @ 10:50