Monday, June 16, 2008

IS NULL versus NOT EXISTS

Here is the pseudo-SQL for a fairly common query:

SELECT * 
FROM t1
WHERE there is no matching row in t2;
The "matching" relationship between tables t1 and t2 is probably a parent-child foreign key relationship like this:
CREATE TABLE t1 (
pk INTEGER NOT NULL PRIMARY KEY );

CREATE TABLE t2 (
pk INTEGER NOT NULL PRIMARY KEY,
fk INTEGER NOT NULL REFERENCES t1 ( pk ) );
One way to code the query uses NOT EXISTS to find which parent rows in t1 don't have any child rows in t2:
SELECT *
FROM t1
WHERE NOT EXISTS
( SELECT *
FROM t2
WHERE t2.fk = t1.pk )
ORDER BY t1.pk;
In other words, "SELECT each row in t1 WHERE NOT EXISTS ( a match between that row in t1 and any row in t2 )"

To me, that's easy to think of, easy to write, easy to read and easy to explain. SQL is a pretty funky language, but WHERE NOT EXISTS is far from the worst it has to offer.

There is another method that does move things further out on the Funky SQL Scale: LEFT OUTER JOIN and IS NULL...
SELECT t1.*
FROM t1 LEFT OUTER JOIN t2
ON t1.pk = t2.fk
WHERE t2.fk IS NULL
ORDER BY t1.pk;
Outer joins give lots of people lots of trouble, why would someone use LEFT OUTER JOIN if there was a simpler alternative?

And just how does it work, anyway?

Dealing with the second question first, the LEFT OUTER JOIN between t1 and t2 guarantees that each row in t1 that does not have matching counterpart in t2 will still be represented in the candidate result set. Also, that candidate row will have NULL values for all the t2 columns. So, when the clause WHERE t2.fk IS NULL is applied, the final result set is limited to just the rows where there is no match in t2.

For rows in t1 that do have matching rows in t2, the t2.fk column will not be NULL, so they are excluded.

Magic! But still... if you stumble upon that query in an application you might be hard pressed to figure out what it's for. Self-documenting code, it's not.

The answer to the first question, "why write it that way?", is the same as the answer to this question:
What is the most common reason people write incomprehensible code?
Answer: Performance! The IS NULL version might run a bit faster than NOT EXISTS!

Here is a diabolical example: big tables, lots of rows, COUNT(*), cold cache that's too small to begin with...
SELECT COUNT(*)
FROM enwiki_text
WHERE NOT EXISTS
( SELECT *
FROM enwiki_entry
WHERE enwiki_entry.from_line_number
= enwiki_text.line_number );
COUNT()
225,348,118
Execution time: 627 seconds

SELECT COUNT(*)
FROM enwiki_text LEFT OUTER JOIN enwiki_entry
ON enwiki_text.line_number
= enwiki_entry.from_line_number
WHERE enwiki_entry.from_line_number IS NULL;

COUNT()
225,348,118
Execution time: 537 seconds
Here's what the two plans look like, NOT EXISTS on the left and IS NULL on the right:


The parent table enwiki_text takes up 20.6G of space and contains 231,900,608 rows, while the child enwiki_entry "only" uses 17.2G to hold 6,552,490 rows:
CREATE TABLE enwiki_text ( 
line_number BIGINT NOT NULL DEFAULT autoincrement,
line_text LONG VARCHAR NOT NULL DEFAULT '',
CONSTRAINT ASA76 PRIMARY KEY CLUSTERED (
line_number ) );

CREATE TABLE enwiki_entry (
page_number BIGINT NOT NULL,
from_line_number BIGINT NOT NULL,
to_line_number BIGINT NOT NULL,
page_title VARCHAR ( 1000 ) NOT NULL,
page_id VARCHAR ( 100 ) NOT NULL,
page_text LONG VARCHAR NOT NULL,
CONSTRAINT ASA80 PRIMARY KEY CLUSTERED (
page_number ) );

ALTER TABLE DBA.enwiki_entry
ADD CONSTRAINT fk_from NOT NULL FOREIGN KEY (
from_line_number )
REFERENCES DBA.enwiki_text (
line_number )
ON UPDATE RESTRICT ON DELETE RESTRICT;
Conclusion? I still prefer NOT EXISTS, but I understand why folks might use LEFT OUTER JOIN and IS NULL... if it helps, and if it matters (it often doesn't).

7 comments:

Anonymous said...

I prefer WHERE NOT EXISTS, too, for readability.
But what about the EXCEPT statement?

How about something like
SELECT t1.* FROM t1
EXCEPT
SELECT t1.* FROM t1
WHERE pk in (SELECT fk FROM t2) ORDER BY t1.pk;

(I do not know about the performance, but in a way the EXCEPT statement seems logically natural for a set-based exclusion).

Just me thoughts

Volker

Anonymous said...

hey how to get number of rows from sqlanywhere. when i am using sqlanywhere_num_rows it is giving me -1 and there is 1 row in the table.What to do???

Breck Carter said...

As of today (December 22, 2008) this is the Number One most popular page on this blog by a wide margin... go figure!

Anonymous said...

I googled "sql example not exist" and it was the number 5 result.

John McCormick said...

It's a very clear and in-depth explanation of a pretty common situation, so I wouldn't be too surprised that it's risen to the top. :)

I ran into it while looking for data on which of these two was best for performance, though my results with ~10^(4-5) row tables under SQL Server 2005 actually point the other way. In any case, the difference is very small.

Thanks for your time putting it together!

kmdavisjr said...

I use outer Join and Null in the child table. Seems like I always get better performance that way, although not exists is easier to read...

Justin said...

It seems that the count(*) in enwiki your example might force the optimizer to optimize for all rows. Do you have any suggestions for optimizing the 'not exists' situation for first rows? I'm doing a "find all unprocessed rows" (where processing creates a row in my child table).