Friday, February 14, 2014

Outer Join Not Behaving As Expected

This article was inspired by the "left outer join not working as expected" conversation on the SQL Anywhere forum...

The Question

How do I show the total bonus amount paid to each employee hired in 2013 or later?

The Data

CREATE TABLE employee (
   employee_id    INTEGER NOT NULL PRIMARY KEY,
   hire_date      DATE NOT NULL );

CREATE TABLE bonus ( 
   employee_id    INTEGER NOT NULL REFERENCES employee ( employee_id ),    
   bonus_date     DATE NOT NULL,
   bonus_amount   DECIMAL ( 9, 2 ) NOT NULL,
   PRIMARY KEY ( employee_id, bonus_date ) );

INSERT employee VALUES ( 1, '2011-02-01' );
INSERT employee VALUES ( 2, '2012-02-01' );
INSERT employee VALUES ( 3, '2013-02-01' );
INSERT employee VALUES ( 4, '2014-02-01' );
INSERT bonus VALUES ( 1, '2011-12-01', 100.00 );
INSERT bonus VALUES ( 1, '2012-12-01', 100.00 );
INSERT bonus VALUES ( 1, '2013-12-01', 100.00 );
INSERT bonus VALUES ( 2, '2012-12-01', 100.00 );
INSERT bonus VALUES ( 2, '2013-12-01', 100.00 );
INSERT bonus VALUES ( 3, '2013-12-01', 100.00 );
COMMIT;

The Expected Output

employee_id hire_date          total_bonus 
----------- ---------- ------------------- 
          2 2012-02-01              200.00 
          3 2013-02-01              100.00 
          4 2014-02-01                0.00 

The Answer

  • Employee 4 doesn't have any bonus rows at all so a LEFT OUTER JOIN is required,

  • plus a SUM and GROUP BY to calculate total_bonus

  • and a COALESCE to turn Employee 4's NULL amount into zero

  • plus the predicate employee.hire_date >= '2012-01-01' to eliminate Employee 1:
SELECT employee.employee_id,
       employee.hire_date,
       COALESCE ( SUM ( bonus.bonus_amount ), 0.00 ) AS total_bonus
  FROM employee LEFT OUTER JOIN bonus
       ON employee.employee_id = bonus.employee_id
       AND employee.hire_date >= '2012-01-01'
 GROUP BY employee.employee_id,
       employee.hire_date
 ORDER BY employee.hire_date;

Thank you for playing, next contestant please!

That might be the expected output, but here's the actual output from "The Answer" above:
employee_id hire_date          total_bonus 
----------- ---------- ------------------- 
          1 2011-02-01                0.00 
          2 2012-02-01              200.00 
          3 2013-02-01              100.00 
          4 2014-02-01                0.00 
Employee 1 is included in the output in spite of the predicate employee.hire_date >= '2012-01-01'

Why is that?

There's a clue in the Help:
A left or right outer join of two tables preserves all the rows in one table, and supplies nulls for the other table when it does not meet the join condition.
Employee 1 does not meet the join condition (because employee.hire_date >= '2012-01-01' is FALSE) so the row is preserved and NULL is supplied for the corresponding bonus.bonus_amount, with the end result being a zero-amount row for Employee 1 is included in the output.

That IS the expected result for the query shown above... not the expected result for ordinary folk (like me, and perhaps like you as well), but the expected result for whoever invented outer joins.

The Help actually contains what appears to be contrary advice:
Outer joins and join conditions

A common mistake with outer joins is the placement of the join condition. If you place restrictions on the null-supplying table in a WHERE clause, the join is usually equivalent to an inner join.

The reason for this is that most search conditions cannot evaluate to TRUE when any of their inputs are NULL. The WHERE clause restriction on the null-supplying table compares values to NULL, resulting in the elimination of the row from the result set. The rows in the preserved table are not preserved and so the join is an inner join.

The exception to this is comparisons that can evaluate to true when any of their inputs are NULL. These include IS NULL, IS UNKNOWN, IS FALSE, IS NOT TRUE, and expressions involving ISNULL or COALESCE.

For example, the following statement computes a left outer join.
SELECT *
FROM Customers KEY LEFT OUTER JOIN SalesOrders
   ON SalesOrders.OrderDate < '2000-01-03';
In contrast, the following statement creates an inner join.
SELECT Surname, OrderDate
FROM Customers KEY LEFT OUTER JOIN SalesOrders
   WHERE SalesOrders.OrderDate < '2000-01-03';
The first of these two statements can be thought of as follows: First, left-outer join the Customers table to the SalesOrders table. The result set includes every row in the Customers table. For those customers who have no orders before January 3 2000, fill the sales order fields with nulls.

In the second statement, first left-outer join Customers and SalesOrders. The result set includes every row in the Customers table. For those customers who have no orders, fill the sales order fields with nulls. Next, apply the WHERE condition by selecting only those rows in which the customer has placed an order since January 3 2000. For those customers who have not placed orders, these values are NULL. Comparing any value to NULL evaluates to UNKNOWN. So, these rows are eliminated and the statement reduces to an inner join.

Now that is excellent advice, but it applies to predicates involving the table on the nullable side of an outer join; e.g., the table on the right side of a LEFT OUTER JOIN. In this case the "clue" applies: employee.hire_date >= '2012-01-01' involves the preserved side of the outer join (i.e., the table on the left side of the LEFT OUTER JOIN) so the predicate is effectively ignored for Employee 1...

... and that's what doesn't make sense to ordinary folk.

One solution is to move the predicate into the WHERE clause. Even though that seems to violate the "common mistake" exhortation above, it really doesn't... and it works by excluding Employee 1:
SELECT employee.employee_id,
       employee.hire_date,
       COALESCE ( SUM ( bonus.bonus_amount ), 0.00 ) AS total_bonus
  FROM employee LEFT OUTER JOIN bonus
       ON employee.employee_id = bonus.employee_id
 WHERE employee.hire_date >= '2012-01-01'
 GROUP BY employee.employee_id,
       employee.hire_date
 ORDER BY employee.hire_date;

employee_id hire_date          total_bonus 
----------- ---------- ------------------- 
          2 2012-02-01              200.00 
          3 2013-02-01              100.00 
          4 2014-02-01                0.00 
There's a better way, however, one that doesn't require you to open the textbook every time you want to code an outer join:

Use derived tables to divide and conquer

A derived table is like a local view. Instead of coding a separate CREATE VIEW or even using a WITH clause, you can code a SELECT right inside the FROM clause. And in this case, that inner SELECT can apply the predicate employee.hire_date >= '2012-01-01' before the LEFT OUTER JOIN is even executed:
SELECT employee.employee_id,
       employee.hire_date,
       COALESCE ( SUM ( bonus.bonus_amount ), 0.00 ) AS total_bonus
  FROM ( SELECT *
           FROM employee 
          WHERE employee.hire_date >= '2012-01-01' 
       ) AS employee
       LEFT OUTER JOIN bonus
       ON employee.employee_id = bonus.employee_id
 GROUP BY employee.employee_id,
       employee.hire_date
 ORDER BY employee.hire_date;

employee_id hire_date          total_bonus 
----------- ---------- ------------------- 
          2 2012-02-01              200.00 
          3 2013-02-01              100.00 
          4 2014-02-01                0.00 
Two tricks make derived tables easier to work with: First, use SELECT * instead of worrying about a column list, and second, use the original table name as the alias name: AS employee. In this example, the only code that changed was this:
  • The table name "employee" in the FROM clause was changed to "( SELECT ... ) AS employee"

  • and the WHERE clause was eliminated.
That might not seem like much for a trivial query like this, but it helps a lot for giant complex queries... including ones that are having INNER JOINS retrofitted with LEFT OUTER JOINS.

No comments: