Using a simple moving average to "smooth out" data is a fairly popular technique... it's too bad the primary example in the SQL Anywhere Help is far from simple:
SELECT * FROM ( SELECT s.ProductID, Month( o.OrderDate ) AS julian_month, SUM( s.Quantity ) AS sales, AVG( SUM( s.Quantity ) ) OVER ( PARTITION BY s.ProductID ORDER BY Month( o.OrderDate ) ASC RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING ) AS average_sales FROM SalesOrderItems s KEY JOIN SalesOrders o WHERE Year( o.OrderDate ) = 2000 GROUP BY s.ProductID, Month( o.OrderDate ) ) AS DT ORDER BY 1,2; ProductID,julian_month,sales,average_sales 300,1,108,108.000000 300,2,108,112.000000 300,3,120,112.000000 300,4,108,116.000000 300,5,120,116.000000 300,6,120,132.000000 300,7,156,132.000000 300,8,120,124.000000 300,9,96,116.000000 300,10,132,124.000000 300,11,144,140.000000 300,12,144,144.000000 ...
What makes that example so complex? Besides the problem statement, that is: "compute the moving average of all product sales, by month, in the year 2000."
Here's what makes it complex:
- two references to the AVG() function,
- two SUM() calls,
- three calls to Month(),
- two ORDER BYs,
- a FROM,
- a WHERE,
- a GROUP BY (which all by itself makes just about any SELECT a head-scratcher),
- plus...
...a stealth WINDOW clause!
...a WINDOW clause that doesn't even use the WINDOW keyword, so to the uninitiated (the folks who need examples more than anyone else) it isn't obvious that a WINDOW is involved at all.Not just any WINDOW clause, mind you, but one that includes every single component you can code in a WINDOW:
- a PARTITION BY,
- an ORDER BY and
- a RANGE clause... not a simple ROWS clause but full-blown RANGE clause, one that has an intimate relationship with the ORDER BY... "I know what a row is, but what the [redacted] is a RANGE?"
(for a more complete discussion of this particular example, see Example 23 - Computing a Moving Average in Glenn Paulley's excellent OLAP white paper.)
Now, let's get back on track:
A Really Really Simple Moving Average
The following example displays 10 days worth of data together with the moving average of today's value and yesterday's:CREATE TABLE t ( entry_date DATE NOT NULL PRIMARY KEY, value INTEGER NOT NULL ); INSERT t VALUES ( '2012-02-01', 10 ); INSERT t VALUES ( '2012-02-02', 20 ); INSERT t VALUES ( '2012-02-03', 10 ); INSERT t VALUES ( '2012-02-04', 30 ); INSERT t VALUES ( '2012-02-05', 10 ); INSERT t VALUES ( '2012-02-06', 40 ); INSERT t VALUES ( '2012-02-07', 10 ); INSERT t VALUES ( '2012-02-08', 50 ); INSERT t VALUES ( '2012-02-09', 10 ); INSERT t VALUES ( '2012-02-10', 60 ); COMMIT; SELECT t.entry_date, t.value, AVG ( t.value ) OVER two_days AS two_day_average FROM t WINDOW two_days AS ( ORDER BY t.entry_date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) ORDER BY t.entry_date; entry_date,value,two_day_average '2012-02-01',10,10.000000 '2012-02-02',20,15.000000 '2012-02-03',10,15.000000 '2012-02-04',30,20.000000 '2012-02-05',10,20.000000 '2012-02-06',40,25.000000 '2012-02-07',10,25.000000 '2012-02-08',50,30.000000 '2012-02-09',10,30.000000 '2012-02-10',60,35.000000
The WINDOW clause on lines 21 through 23 defines a moving window that contains two rows: today's row (CURRENT ROW) and yesterday's row (1 PRECEDING):
- the WINDOW ORDER BY clause determines what "PRECEDING" means (the preceding row by t.entry_date) and
- the ROWS clause determines the size of the window (always two rows).
- So, for '2012-02-02' the average of 10 and 20 is 15.000000,
- for '2012-02-03' the average of 20 and 10 is 15.000000,
- for '2012-02-04' the average of 10 and 30 is 20.000000,
- ... and so on ...
- for '2012-02-10' the average of 10 and 60 is 35.000000.
Oops, what about the first row?
The '2012-02-01' row doesn't have a PRECEDING row, so what is the average over the moving window?According to Glenn Paulley's white paper "in the case of a moving window, it is assumed that rows containing Null values exist before the first row, and after the last row, in the input."
That means when the moving window has '2012-02-01' as the "CURRENT ROW", the "1 PRECEDING" row contains NULL values... and when SQL Anywhere computes an AVG() that includes a NULL value, it doesn't count the NULL at all... not in the numerator or in the denominator when computing the average. Here's proof:
CREATE TABLE x ( c INTEGER ); INSERT x VALUES ( NULL ); INSERT x VALUES ( 10 ); COMMIT; SELECT AVG ( c ) FROM x; AVG(x.c) 10.000000That's why two_day_average = 10.000000 for the first row '2012-02-01'.
1 comment:
Hello, thank you for explaining this so clearly and providing great documentation. I do have a question - what if I am comparing week-to-week numbers (moving average), but they might not be consecutive rows (current week is row 3322 and previous week is row 3019)? Thanks so much, Kim
Post a Comment