Bug hunting |
2009-04-05
|
Yesterday, I was working some more on my jukebox code. It is now in the state where I can actually have it play a playlist and do the mixing properly. So I'm finally into dog food mode if you will, or in the mode where I can actually enjoy the result of the code I've written.
But I still had random failures that I couldn't make sense of that I thought were due to gnonlin. I was getting random backtraces on startup when setting up my gnl objects, like so:
Traceback (most recent call last):
File "examples/jukebox.py", line 123, in
main()
File "examples/jukebox.py", line 115, in main
main = Main(loop, tracks, playlist)
File "examples/jukebox.py", line 67, in __init__
self._jukebox.add_track(path, self._tracks[path][0])
File "/home/thomas/svn/src/dad/trunk/dad/gstreamer/jukebox.py", line 81, in add_track
self._process()
File "/home/thomas/svn/src/dad/trunk/dad/gstreamer/jukebox.py", line 137, in _process
operation.props.duration = mix.duration
TypeError: could not convert argument to correct param type
I had been noticing them before but always thought "I'll get to that later." But yesterday it was getting on my nerves.
The problem was, there was nothing wrong with the duration I was setting. 'duration' is an INT64 property on a gnloperation, and the duration I was setting was a normal long. I spent hours sprinkling debugging code and diving deeper into the stack to figure out where it was going wrong, but couldn't turn up anything.
I did see that in pygobject's gobject/pygtype.c code, it was checking PyErr_Occurred, and just resetting the error and the GValue. I added a PyErr_Print to see why the conversion was failing, and it told me:
TypeError: can't convert negative long to unsigned
This made even less sense - I wasn't setting a negative long, and I wasn't setting it to an unsigned property either. It's too bad the pygtype code wasn't bubbling up this clear error up, instead obfuscating it with the generic error, but it still didn't explain my bug.
We had invited friends over for dinner, so in the afternoon I was forced to take my mind off the keyboard, but as you all know that doesn't mean your mind is off the problem. I'm sure you all know that feeling where you have lots of different pieces of the puzzle lying in front of your mind's eye, and none of your theories made sense and you still see a bunch of different pieces.
Dinner was prepared, friends came, a good time was had by all, and my mind was off the problem. My only distraction was listening to the mixing points at the end of each song. And then when they left around 1 AM, Kristien and I sat on the couch, were watching some TV, it suddenly hit me. A classic bug pitfall.
What if the error was actually triggered before the line of code I was looking at ?
I'm sure you also all know that feeling where suddenly, a different way of looking at the problem makes all the pieces of the puzzle fall into place. It's probably the most gratifying moment when doing bug hunting - after hours of frustrating undirected prodding, suddenly you get that one theory that explains both the problem and the lack of result in the prodding.
That's when, instead of flying to the computer and trying it out, I prefer to savor the moment a bit. Because there's still the possibility that I might be wrong, and just uncover another layer of fur on the yak you're shaving. But no, everything seemed to add up, and I knew in my mind exactly the one line of code I needed to add to either prove or disprove your theory. And I also knew that this one line of code was the last thing I needed to try that day. Nothing else. Before, every 'just 2 more minutes' I asked for where valiant efforts borne from bug hunting optimism. Now, I see the problem crystal clear.
If the one line proves the theory, I can go to bed safely, be happy that I found the problem, that my stack ws unwound, and I was holding my juggling balls of my mental state in my hands. (Of course, if it disproved the theory, I was in for a night of bad sleep, going over the problem again and again in my head).
So, while Kristien got ready for bed, I moved back to the keyboard, and before the call that I thought was failing in the bindings, added another PyErr_Occurred check, and ran it. And there it went - and it confirmed that the error was actually generated before the line I was looking at, and coming from somewhere else.
So, off to bed, knowing that in the morning I knew exactly where to look.
After some futzing around, I realized that the call before was setting a negative value on a UINT64 property, and it was not erroring out. I was able to reproduce it from a python shell:
>>> import gst
>>> e = gst.element_factory_make('gnlsource')
>>> e.props.start = -1L
>>> import gobject
Traceback (most recent call last):
File "
TypeError: can't convert negative long to unsigned
And after digging through the history of that code, I found the revision where a change was made that removed the error checking for this particular case.
I filed a bug, and that was it. Hours of frustration boiling down to a simple commit labeled 'guint64 property fix'. The patch is trivial, so I'm sure they will do a better job at it than I would.
Of course, the actual bug is still in my code - I need to figure out why I'm setting a negative start value.
But again I am reminded of a fundamental bug fixing rule that I broke again this time: before trying to fix the bug where it appears to be, first make sure that the bug is where it appears to be. Sometimes the bug is created before where it is detected, and it's a lot easier to verify that first before anything else.
Kinda makes you wish python was a statically-typed language ;-)
Comment by anon — 2009-04-05 @ 13:55