Tuesday, March 4, 2014

Case Study: Intra-Query Parallelism

There have been several articles in this blog on the subject of Intra-Query Parallelism (using more than one CPU to handle a single request simultaneously) in SQL Anywhere... there has even been a fRiDaY File on the subject.

This article is about a real-world client experience with

  • unusually high CPU peaks accompanied by reports of

  • "very, very slow" application performance; e.g., 2 minutes to login instead of fractions of a second
using a SQL Anywhere database.

The server-level performance statistics displayed by Foxhound did not reveal any clues, nor did a cursory examination of the connection-level statistics... no obvious runaway connections, no apparent "Queries From Hell".

A more boring, laborious thorough examination of connection history revealed that from time to time, some (not all) of the peaks were accompanied by the appearance of exactly four internal or "temporary" connections; no connection names or other identifying data, just four large consecutive connection numbers like 1002046681, 1002046682, 1002046683 and 1002046684.

The "Temporary connections are named" enhancement in SQL Anywhere 12.0.0 introduced CONNECTION_PROPERTY ( 'Name' ) values like 'INT:Exchange' to help identify what each internal connection is doing, plus the 'ParentConnection' property to help identify the connection that started each internal connection. However, those properties weren't available in the SQL Anywere database being studied, so Foxhound could only show these connections as nameless standalone entries in the connection section at the bottom:

  • The top line at "1:29:21 PM" shows a CPU Time peak of 50.8%, and the bottom section shows four connections that were active when the top sample was recorded.

  • The "Conns 57" number in the top section disagrees with the "61 connections" title in the bottom section because the database-level ConnCount property doesn't include temporary connections.

  • The bottom section was sorted by "Conn #" in decreasing order to force the large-numbered internal connections to the top.

  • The "CPU %" values in the bottom section are wildly exaggerated (they add up to more than 100%) even though Foxhound has adjusted the numbers to account for the fact 4 CPUs were in use. The next version of Foxhound will do a better job at showing believable numbers (the problem lies with SQL Anywhere) . . . but it doesn't matter here . . . what matters is these four connections were sucking all the air out of the room.

  • Experience showed that intra-query parallelism was the likely reason these four connections exist, which meant that somewhere among the other connections there was a parent responsible for all the activity... the fact that the number of internal connections always exactly matched the number of CPUs was a helpful clue.

  • Not shown in the image above is the fact that finding the parent connection among 57 others was virtually impossible because the "CPU %" column offered no clue; the peaks only showed up for the internal (child) connections, not the parent that was living off child labor managing the work effort.

  • Also not shown in the image above is the fact that no "Last Statement" values showed up in the bottom section because "RememberLastStatement" was not turned on, so even if the parent connection could have been determined there was no way to tell what it was doing.
Nevertheless, experience (also) showed that intra-query parallelism was the likely culprit, and that doing this
would make make it all better.

And verily forsooth, such was the case!

That was the good news (for the client)... the bad news (for the consultant) was that no more work was required. Perhaps further study would have revealed a Quey From Hell or two, and even more billable hours that could be spent on query tuning, but that would have violated The First Rule Of Program Optimization: Don't do it.

Or in other words, if it's running OK, leave it alone.

Or, if it ain't broke, don't fix it.

Not something you would ever hear from Marty Kaan :)


Justin Willey said...

Do you see the same sort of problem in v12 & v16?

If so, is there an argument for turning off intra-query parallelism for databases with significant numbers of connections as default? (At least until whatever is causing these problems is fixed - if it is a bug of course).

Justin Willey said...

What I mean is not that intra-query parallelism is in itself bad, but it looks like it can be very greedy; which is bad news if you are trying to provide a decent level of response to all connections.

Breck Carter said...

@Justin: TL;DR alert!

In hindsight, the well-deserved bad reputation gained by intra-query parallelism over the years is a good, solid argument for making the default "off" back when SQL Anywhere Version 10 was first released. Of course, because "Watcom Does Things The Way They Should Be Done" is ALSO a well-deserved compliment, SET max_query_tasks = '1' as the the default would have guaranteed very few clients would have used it... and that would have reduced the effectiveness of the valuable "User Test" (you know, that exciting period immediately following "Beta Test").

In other words, it would have taken much longer to get the bugs out.

Here's an analogy that might help describe the current situation (not prove anything, just describe): intra-query parallelism is like thalidomide... both were promoted as great advances, and with both the bad experiences of early adopters now makes it difficult to convince some folks to use it, even when it is clearly beneficial in many applications.

No, I don't tell anyone "turn intra-query parallelism off by default", not even Version 10 users. But, I do tell folks "turn it off at the first sign of trouble". I also tell folks that once they have turned it off, do not turn it on again without proof that it makes a needed improvement, with "needed" and "improvement" two different concepts (performance improvements are not always necessary, they may be too small to be noticed, the users may be perfectly happy, etcetera).

I am guessing that it's hard to prove intra-query parallelism helps an actual production workload... I've never seen it done. Proving it hurts is trivially easy.

Is intra-query parallelism better in versions 12 and 16? I have no idea. Seriously... I am the wrong person to ask :) [insert offensive thalidomide joke here]

Jeff Albion said...

>> (At least until whatever is causing these problems is fixed - if it is a bug of course).

Notably, these problems are already solved:


of which at least one known bug still exists in 10.x and will not be fixed.

Breck Carter said...

@Jeff: For what it's worth, Justin Willey is NOT the client mentioned in the article, and the performance problem described in the article had NOTHING to do with the "connections that cannot be dropped" problem.

Jeff Albion said...

Yes, agreed. I was pointing out general existing intra-query parallelism issues on 10.x, for stability purposes (I read something about thalidomide...? :).

For performance issues specifically, I'd also have to point you to other fixes, like CR #545528. Specifically, for an issue where we were inappropriately using intra-query parallelism ("too greedy") and has been fixed, see: CR #733181.

Breck Carter said...

CR 733181 fixes an "All threads are blocked" condition which is hardly a "performance issue" unless you regard thrash-and-splat a "performance issue"... I don't, I call it a bug :)

CR 545528 was applied over 5 years ago ( 01 Oct 2008, 15 Dec 2008). The description says it only applied to "inexpensive statements", but that doesn't matter; what matters is that it didn't help the database discussed above.

Everyone wants intra-query parallelism to be successful! In the meantime, the wise hope for the best and prepare for the worst.