Present Perfect


Picture Gallery
Present Perfect


Filed under: Fluendo,Hacking — Thomas @ 10:47 pm

10:47 pm

So, our server has been tested in production a bunch of times. Each time, it runs fine for fifteen minutes, and clients connect to it all the time. It serves about 500 streams without any issues, only about 1% CPU usage total. At some random point in time however it drops clients and hangs; it looks like it’s hanging in a read, but the stack trace seems corrupted.

So the hard thing about this kind of problem is that we cannot trigger it in a local setup where we simulate clients in a dumb way (1000 wget processes, for exmaple), and that on the server it’s hard to get usable debug info. The log file with only DEBUG logging from one element on the server is about 600 MB by the time this problem happens. And really, 1000 wget’s are no good simulation for a 1000 real users, each with their own network speed, and each with their own “Reload” push frequency.

I’ve searched the net for network stress test tools, but haven’t found anything yet I can use. All of the web stress test tools use the complete request as a testing primitive. Meaning, a successful request is one where you get a complete reply, a full page, and a normal end. Of course are streams are “infinite” so we cannot use these test apps.

Other network testing tools work more low-level, which would mean we’d have to write some TCP/HTTP handling code as well. Really, what we’d need is some tool that allows us to get a URL, specify the client’s bandwidth and possibly bandwidth profile, and keep connections alive for a random amount of time. If you know of anything, let me know.

Anyway, I started reading about possible limits for file descriptors and so on, and learned a bunch of useful new stuff. Then I started theorizing about possible failure scenarios from what I had learnt, and then I went through our plugin code again to see if these cases could be triggered. I also thought about how I could test each of these test cases.

The actual bug seems to be a really silly oversight in handling some error cases, but the good thing is I got about ten different points to watch out for and how I could reproduce, test and fix. I can hardly wait to get to work tomorrow to start doing all these tests, because something tells me this will fix our problem and give us a rock solid server. Or, at least, one that runs for more than 15 minutes when faced with a lot of clients :)

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URL

Sorry, the comment form is closed at this time.