Friday, December 31, 2010

What "Hopelessly Overloaded" Looks Like

One of my favorite Foxhound regression tests is the one I call "the hopelessly overloaded database server". That's when I throw so much work at a SQL Anywhere database that ALL the lights on the server come on, ALL the fans start running and almost ALL the load-related Foxhound alerts appear in my inbox.

Yesterday, however, I broke an important programming rule: I looked to see what was happening while walking out the door at the end of the day. NEVER do that, not if you actually want to get out the door in a good mood.

But alas, it was a shoulder-slumping moment: Instead of recording a sample every 10 seconds like it's supposed to, Foxhound was taking OVER TWO MINUTES to record what the database server was doing.

And worse: It was issuing an Alert #1 "Database unresponsive" message, followed by an All Clear, in pairs, every two minutes. A previous test using an earlier build of Foxhound exhibited no such behavior.

Something must be running slower... much slower... inside Foxhound.

A bug which needs fixing... this is the reason we run regression tests, right?

[fretful sighing sound]

Ha HA! NOT a bug!

It's a behavior change!

The difference was not the two minutes to record a sample, the old version of Foxhound took the same amount of time.

The difference was the annoying Alert / All Clear messages appearing with every single sample.

OK, it's a bug... be dealt with later. In the meantime, here's the new Foxhound FAQ with a screen shot showing what "hopelessly overloaded" looks like...

Question: Why is Alert #1 - Database unresponsive issued and cleared with every sample gathered?


Here's a quick workaround, to stop Foxhound from issuing so many Alert #1 - Database unresponsive messages...
Use the Alerts Criteria Page to increase the threshold for Alert #1 - Database unresponsive from 1 minute to some value larger than the time it takes the Foxhound Monitor to gather a sample (the "Interval" time shown on the Monitor page).
Here's the long answer...
When the target database server is heavily overloaded, Foxhound may take longer than one minute (or whatever the threshold is for Alert #1 - Database unresponsive) to gather a sample. In that case Foxhound will issue Alert #1 while it's waiting for the sample data to be returned, and then immediately issue an All Clear when it does get the data.

This is new behavior for Foxhound. Previously, Alert #1 messages were only issued if Foxhound failed to gather a successful sample. In this case, however, Foxhound isn't actually failing to gather samples, it's just taking a long time.

The change was made because it is important for you to know when your server is hopelessly overloaded as well as when it is completely unavailable. However, the multiple Alert - ALL CLEAR messages are annoying, and something will probably be done about that in the future.

The following image shows what "hopelessly overloaded" looks like. The target database server is using 64% of a four-core CPU, but the server computer is actually running at 100% CPU... it's also running a multi-threaded client application with 1003 database connections performing 7,400 database update transactions per second. The client application wants to do more, but everything is maxed out, and instead of recording a sample every 10 seconds, Foxhound is taking more than two minutes for each one. To make matters worse, Foxhound is also running on the server computer; in this case, the first step should be to move Foxhound and the client application to some other computer(s).

See also... The Alerts Criteria Page

No comments: