I just had to vent |
2004-07-20
|
Spent an hour on trying to figure out a bug. For the sake of argument, let's say I was trying to figure out why
echo "kaka pipi" | sed 's/\S/./g'
does not substitute every letter with a dot.
Of course, every utility on UNIX systems that USES regular expressions DOES IT DIFFERENT.
grep uses "basic" and "extended", where egrep uses extended by default, and "basic" is something that is "less powerful". What the part on regexps in grep's man page fails to mention (but the list of options has it) is that you can choose a third style, perl-style regexps. Then the usual difficult-to-grok explanation of grep's idea of "extended" regexps is explained in gory details.
"awk" claims to be using the same regexp style as "extended" in grep, then goes on to explain, in a different way, what that means. Mere mortals have no way of verifying the initial claim. Of course awk comes with some switches that modify the regexp engine's behaviour.
"tr" works on sets, so the man page doesn't even use the term "regular expression", but goes on to describe a syntax of how to specify single characters or sets of characters that sort-of-resembles-regexp-style-but-not-sure-which.
"ed" describes regular expressions, explains what it allows, and makes no reference to any other implementation. Again, impossible to tell if it's the same as any of the other or not.
The list goes on, and contains vim, emacs, and all programming languages.
So why does this matter ? If you test a regexp with one application, until it works, growing it genetically like we all do, everything's fine, no ?
Not really.
Running
echo "kaka pipi" | sed 's/\s/./g'
gives:
- kaka.pipi on FC2
- kaka pipi on RH9
- kaka pipi on Mac OSX
- kaka.pipi on current Gentoo
Interestingly, running
echo "kaka pipi" | sed 's/\W/./g'
gives:
- kaka.pipi on FC2
- kaka.pipi on RH9 (!)
- kaka pipi on Mac OSX (!)
You can't trust sed to either make sense or be consistent across platforms. So back to square one, and this time add a note to my code that says exactly why we stick to this one piece of evolutionary grown sed syntax - because it's the only thing that Works.
And next time someone complains that some new code we are writing is not consistent, and "all the UNIX tools can be consistent, why can't you", I have something to point them to.