- OLAP stands for On-Line Analytical Processing.
- Following are important OLAP functions:
· RANK
- (Rankings)
· QUANTILE
- (Quantiles)
· CSUM
- (Cumulation)
· MAVG
- (Moving Averages)
· MSUM
- (Moving Sums)
· MDIFF
- (Moving Differences)
· MLINREG
- (Moving Linear Regression)
- OLAP functions are similar to aggregate functions in the sense that they operate on groups of rows.However they are different from aggregate functions in the sense that they show detail rows as well. This is not possible with aggregate functions.
===========================================
Cumulative Sum:
1.
Cumulative sum (CSUM) computes a
running or cumulative total of a column’s value.
2. Syntax: CSUM(columnname,sortlist)
Here the sortlist is used to sort the
data by some column before cumulative sum is performed. Default sequence is ascending.
Example:
SELECT CSUM(salary) from table1 gives an error.
NOte:
SELECT CSUM(salary,salary) from table1 is also a valid query.
Data is first sorted by Salary column and then used to perform CSUM.
Example:
SELECT EMPLOYEEID , CSUM(SALARY,EMPLOYEEID) FROM EDW_RESTORE_TABLES.TESTY;
employeeid CSum(Salary,employeeid)
1 10000
2 20000
2 40000
4 60000
5 90000
6 100000
7 110000
8 120000
It is very important to note that CSUM needs a sorting sequence which is first used to sort the data in table and then use it to perform cumulative sum on the column mentioned.1 10000
2 20000
2 40000
4 60000
5 90000
6 100000
7 110000
8 120000
SELECT CSUM(salary) from table1 gives an error.
NOte:
SELECT CSUM(salary,salary) from table1 is also a valid query.
Data is first sorted by Salary column and then used to perform CSUM.
3.
Example:
SELECT salesdate, sales, CSUM(sales,
salesdate)FROM daily_sales;
salesdate
sales
Csum
98/01/01
150.00 150.00
98/01/02
200.00 350.00
98/01/03
250.00 600.00
98/01/05
350.00 950.00
98/01/10
550.00 1500.00
98/01/21
150.00 1650.00
98/01/25
200.00 1850.00
98/01/31
100.00 1950.00
98/02/01
150.00 2100.00
98/02/03
250.00 2350.00
98/02/06
350.00 2700.00
98/02/17
550.00 3250.00
98/02/20
450.00 3700.00
98/02/27
350.00 4050.00
In above example Data is sorted by
Salesdate and column sales is used for performing cumulative sum.
===========================================
Cumulative Sums With Reset:
1.
A cumulative sum may be reset to zero
at specified breakpoints.
2.
This is done with the help of
’GROUP BY’ statement. Group BY indicates that when value of column
specified in ‘group by’ changes the cumulative sum should be reset.
3.
Also its very important to note that
OLAP and standard aggregations (SUM, COUNT,AVG, MIN,MAX) are not compatible
within the same query. Since OLAP and aggregate functions cant be used together
, GROUP BY serves a separate purpose for each of the types of query.
4.
Example:
SELECT EMPLOYEEID , DEPARTMENTNO,CSUM(SALARY,EMPLOYEEID) FROM EDW_RESTORE_TABLES.TESTY GROUP BY DEPARTMENTNO;
employeeid departmentno CSum(Salary,employeeid)
1 100 10000
2 100 20000
2 200 20000 --> accumulation resets
4 200 40000
5 300 30000--> accumulation resets
6 300 40000
7 400 10000--> accumulation resets
8 500 10000--> accumulation resets
1 100 10000
2 100 20000
2 200 20000 --> accumulation resets
4 200 40000
5 300 30000--> accumulation resets
6 300 40000
7 400 10000--> accumulation resets
8 500 10000--> accumulation resets
NOte that we are using GROUP BY. Here the function of group by is different from what it does with aggregate functions. Here group by is used to provide partitioning logic.
===========================================
Moving Average (MAVG):
1. Used
to calculate Moving average on a column.
2. The
number of rows used for aggregation operation is called as query width.
3. Syntax:
MAVG(colname, n, sortlist)
colname =
the column on which the moving average is computed
n = the number of rows (< 4096) which will be used in the calculation including the current row.
n = the number of rows (< 4096) which will be used in the calculation including the current row.
('n' is also refered to as the
'width' of the average)
sortlist = the column(s) which determine the sequencing of the rows. Default is ascending.
sortlist = the column(s) which determine the sequencing of the rows. Default is ascending.
4. This
function computes the moving AVG of a column based on some number of preceding
rows.
5. If
the number of rows preceding the current row is less than the width, the
average is computed based on the existing preceding rows.
6. Example:
SELECT EMPLOYEEID,SALARY,MAVG(SALARY,2,EMPLOYEEID) FROM EDW_RESTORE_TABLES.TESTY
employeeid Salary MAvg(Salary,2,employeeid)
1 10000 10000.00
2 10000 10000.00 avg of current row and prev row
2 20000 15000.00
4 20000 20000.00
5 30000 25000.00
6 10000 20000.00
7 10000 10000.00
8 10000 10000.00
For computation purpose current row and the preceding n-1 rows
are used.
If the number of rows is less than
n-1, it uses all preceding rows.
It sorts in ascending order by sortlist column(s) as the default.
===========================================
Moving Sum (MSUM):
1. Used
to calculate Moving sum on a column.
2. The
number of rows used for aggregation operation is called as query width.
3. Syntax:
MSUM(colname, n, sortlist)
colname =
the column on which the moving sum is computed
n = the number of rows (< 4096) which will be used in the calculation including the current row.
n = the number of rows (< 4096) which will be used in the calculation including the current row.
('n' is also refered to as the
'width' of the average)
sortlist = the column(s) which determine the sequencing of the rows. Default is ascending.
sortlist = the column(s) which determine the sequencing of the rows. Default is ascending.
4. Example:
SELECT salesdate, itemid, sales,
MSUM(sales, 3, salesdate)
WHERE itemid = 10
WHERE itemid = 10
FROM daily_sales;
Result
salesdate
itemid
sales
MSum
98/01/01
10 150.00
150.00
98/01/02
10 200.00
350.00 Sum of 2 rows
98/01/03
10 250.00
600.00 Sum of 3 rows starting row 1
98/01/05
10
350.00
800.00 Sum of 3 rows starting row 2
98/01/10
10 550.00
1150.00 Sum of 3 rows starting row 3
98/01/21
10 150.00
1050.00
98/01/25
10
200.00
900.00 Sum of 3 rows
98/01/31
10
100.00
450.00
98/02/01
10
150.00
450.00
98/02/03
10
250.00
500.00
98/02/06
10
350.00
750.00
98/02/17
10 550.00
1150.00
98/02/20
10 450.00
1350.00
98/02/27
10 350.00
1350.00
5. Moving
Sum (MSum) follows the same rules as Moving Averages (MAvg):
· Uses current row and preceding n-1 rows.
· Uses all preceding rows if less than n-1.
· Sort ascending by sortlist column(s) is the
default.
=======================================
Moving
Differences (MDIFF):
1. The
Moving Difference (MDIFF) function permits a calculation of a moving difference
of a specified column, based on a defined query width (n).
2. The
width determines how many rows back to count for the subtrahend (i.e. the
number being subtracted).
3. If
there are less than n preceding rows, a null will be generated to represent the
difference.
4. Syntax:
MDIFF(colname,
n, sortlist)
colname = the column on which the moving sum is computed
n = the number of rows (< 4096)
colname = the column on which the moving sum is computed
n = the number of rows (< 4096)
sortlist =
the column(s) which determine the sequencing of the rows. Default is ascending.
5. Example:
SELECT
salesdate, itemid, sales, MDIFF(sales, 3, salesdate)
FROM
daily_sales;
Result
salesdate
itemid sales MDiff
98/01/01
10 150.00 ? Null
because there is no value 3 rows above
98/01/02
10
200.00 ? Null because there is no value 3
rows above
98/01/03
10 250.00 ? Null
because there is no value 3 rows above
98/01/05
10 350.00 200.00
Difference this row and row 1
98/01/10
10 550.00 350.00
Difference this row and row 2
98/01/21
10 150.00 -100.00
98/01/25
10 200.00 -150.00
98/01/31
10
100.00 -450.00
98/02/01
10 150.00
.OO
98/02/03
10 250.00
50.00 Difference of 2 rows
98/02/06
10 350.00 250.00
98/02/17
10 550.00 400.00
98/02/20
10 450.00 200.00
98/02/27
10 350.00
.00
6. The
usage of MDIFF is slightly different than MAvg and MSum in that it:
· Uses current row and preceding nth row
· Value is null if there is no preceding nth row
· Sort ascending by sortlist column(s) is default
========================================
Rank
function:
1. The
rank function allows a column to be ranked either based on high or low order,
against other rows in the answer set.
2. By
default, the output will be sorted in descending sequence of the ranking
column. This in short means that highest value in the ranked column gives rank
1.
3. Syntax:
RANK(columnname).
where
colname represents the column to be ranked and the descending sort key of the
result.
4. Example:
SELECT
storeid, prodid, sales, RANK(sales)
FROM
salestbl
WHERE
storeid = 1001;
storeid
prodid ____sales Rank
1001
F 150000.00 1
1001
A 100000.00 2
1001
C 60000.00 3
1001
D 35000.00 4
5. Points
to note are:
· When Ranking is applied -
default highest amount is low rank #.
· The default sort sequence is descending
by ranking column (sales).
=======================================================
RANK with ‘Qualify’ and ‘GROUP BY’:
1. The
QUALIFY clause allows restriction of which rankings will be output in the final
result.
2. QUALIFY
performs like the HAVING clause by requesting a specific range in the output.
3. Example:
SELECT storeid, prodid, sales,
RANK(sales)
FROM salestbl
GROUP BY storeid
QUALIFY rank(sales) <= 3;
Result
storeid
prodid
sales Rank
1001
F 150000.00 1
1001 A
100000.00 2
1001 C
60000.00 3
1002 A
40000.00 1
1002 C
35000.00 2
1002 D
25000.00 3
1003 B
65000.00 1
1003 D
50000.00 2
1003 A
30000.00 3
4. Note
that here one more feature is being used . ie GROUP BY.
The GROUP BY isn't doing an
aggregation. It is actually changing the scope of the query. It also causes an
ascending sort.
5. The
default sort sequence is descending by ranking column (sales).
Due to GROUP BY, the sort is by sales
ascending within store. With in a given store the sort is descending by the
Ranked column .ie sales.
6. Qualify
rank(sales) <= 3 means it only outputs rows that have rank less than or
equal to 3. This means top selling products for each store.
========================================
More
on ‘GROUP BY’ clause:
1. The
GROUP BY clause may be used in conjunction with a RANK function to change the
scope of the ranking.
2. Without a GROUP BY clause, the scope defaults to
the RANK column.
IN
the the below example the scope is the ranking column itself.
Example:
SELECT
storeid
,prodid
,sales
,RANK(sales)
FROM
salestbl
QUALIFY
RANK(sales) <= 3
;
Result
storeid
prodid
sales Rank
1001
F 150000.00 1
1001
A 100000.00 2
1003
B 65000.00 3
3. ‘GROUP
BY” adds a major sorting column as well.
Ascending
Sort is performed first on the column in GROUP BY and then sorted descending on
the ranked column.
Example:
SELECT
storeid
,prodid
,sales
,RANK(sales)
FROM
salestbl
GROUP
BY storeid, prodid
QUALIFY
RANK(sales) <= 3
;
Result
storeid
prodid
sales Rank
1001
A 100000.00 1
1001
C 60000.00 1
1001
D 35000.00 1
1001
F 150000.00 1
1002
A 40000.00 1
1002
C 35000.00 1
1002
D 25000.00 1
1003
A 30000.00 1
1003
B 65000.00 1
1003
C 20000.00 1
1003
D 50000.00 1
Because
these three elements make up a ranking group and because the distinct elements
store and product only yield a single row, there are no rankings below rank
one. This is a case where the scope matches the granularity of the table, thus
no reasonable ranking can be produced.
================================
Low
order Ranking and ORDER BY clause:
· Sometimes we may need to use low order ranking .For
example: To sell top poorest selling products etc.
· This is accomplished by changing the default order
of the rank column from descending to ascending.
For
this we use RANK(column ASC).
· Example:
SELECT
t.prodid, t.sumsales, rank(t.sumsales)
FROM
(SELECT a.prodid, SUM(a.sales) FROM salestbl a
GROUP BY 1) AS t(prodid, sumsales)
QUALIFY
RANK(sumsales ASC) <= 3;
· We can also use qualify when using rank.
The
QUALIFY clause <= 3 means produce a list of the ‘first three’ whatever the
sequencing of the list.
So
if we are using rank with ascending sort on the column then it produces lowest
3 records.
· Simpler example is as below:
SELECT SALARY,RANK(salary) FROM EDW_RESTORE_TABLES.TESTX
salary Rank(salary)
50000 1
40000 2
30000 3
20000 4
10000 5
SELECT SALARY,RANK(salary ASC) FROM EDW_RESTORE_TABLES.TESTX
salary Rank(salary ASC)
10000 1
20000 2
30000 3
40000 4
50000 5
This
it is clear that with ASC the lowest gets the rank 1.
This
is any case first row will always have a rank 1.This can be changed by using
ORDER BY clause.
When
order by clause is used, the ORDER BY becomes the major sort in the results
list.
SELECT SALARY,RANK(salary ASC) FROM EDW_RESTORE_TABLES.TESTX order by salary desc;
salary Rank(salary ASC)
50000 5
40000 4
30000 3
20000 2
10000 1
-====================================================
-====================================================
Using Quantiles:
·
Quantiles are used to divide a number
of rows into a number of partitions of roughly equal size.
·
The most common quantile is the
percentile which is based on a value of 100.We can also have quartiles,deciles
etc.
·
Syntax:
QUANTILE
(quantile_constant,sortlist)
quantile_constant = constant used to define number
of quantile partitions.
sortlist = column reference to be used for the
partitioning and to sort the output.
·
By default the ordering is desc. .ie
QUANTILE(quantile_constant,sortlist desc) is same as QUANTILE
(quantile_constant,sortlist).
The
DESC specification is the default for the QUANTILE function.
·
Example:
QUANTILE(3,
sumsales DESC)
The
DESC qualifier on the column indicates only that the column ‘sumsales’ will
descend as the quantile value itself descends, regardless of the requested ordering
(i.e. ORDER BY) of the output. .ie Lowest will get the lowest quantile.
In
short, adding DESC is unnecessary as it is the default.
As
always, the ORDER BY clause may be used to reorder the output.
Example:
SELECT
salary,QUANTILE(100,salary DESC) FROM EDW_RESTORE_TABLES.TESTX
salary Quantile(100,salary)
10000 0
20000 20
30000 40
40000 60
50000 80
·
Using ASC is the non-default option.
In this case the column ‘sumsales’ will ascend as the quantile value descends.
In other words, the highest will correspond to the lowest quantile value.
With
quantiles, an ascending quantile means that the lowest value gets the highest
quantile.
SELECT salary,QUANTILE(100,salary ASC) FROM EDW_RESTORE_TABLES.TESTX
salary Quantile(100,salary ASC)
50000 0
40000 20
30000 40
20000 60
10000 80
Quantiles can be used to answer questions like “Show
all employee salaries that are in the top 25 % of the company.”
·
We can have multiple columns in the
sortlist.
Example:
QUANTILE
(100, salary_amount, employee_number)
The
purpose of the ‘employee_number’ here is to handle a situation where there is a
tie on the salary amount. Remember, the ‘salary_amount’ parameter (with a
default of DESC) tells us that the salary_amount will descend as the quantile
value descends.
The
default sort for quantiles is the column descends as the quantile descends.
Thus, salary amount descends as quantile descends. If there is a tie in the
salary, then employee number descends as the quantile descends.
Using Samples:
·
The SAMPLE function is used to
generate samples of data from a table or view.
·
This can be done in two ways:
1. Sample n where n is integer. It will return n
number of rows.(Actual number of rows)
2. Sample n where n is a fraction. That will return
that much fraction of the table.(.ie Percentage of the table)
·
If the number n is greater than the
number of rows in the table, the sample will consist of the number of rows in
the table.
·
Rows are not reused within the same
sample.
·
Note that calculations resulting in
fractional rows must be greater than .5 to actually return a row.
In
the following example, 25% of the rows of the employee table are to be
returned. The employee table has 26 rows.
26
* .25 = 6.50 = 6 rows in the sample
·
Examples:
SELECT COUNT(*) FROM EDW_RESTORE_TABLES.TESTX
Count(*)
5
Here the table has 5 rows.
SELECT * FROM
EDW_RESTORE_TABLES.TESTX SAMPLE 3
salary
20000
30000
50000
SELECT * FROM EDW_RESTORE_TABLES.TESTX SAMPLE .5
salary
20000
10000
30000
·
Multiple sample sets may be generated
in a single query if desired. In order to identify the specific set a tag
called the SAMPLEID is made available for association with each set. The SAMPLEID
may be selected, used for ordering, or used as a column in a new table.
·
When using multiple sets rows are not
repeated to different sample sets.
·
SAMPLEID is a keyword that assigns a
sample-id to a specific sample set.
·
Example:
SELECT SALARY,SAMPLEID FROM EDW_RESTORE_TABLES.TESTX
SAMPLE .5,0.25
salary SampleId
20000 1
10000 2
30000 1
50000 1
In
above case if the sum of the sample fractions increases beyond 1 then it gives
an error.
.ie
0.5,0.5,0.25 gives error as it gives beyond 1.
·
Example:
SELECT department_number,SAMPLEID FROM department SAMPLE 3, 5, 8 ORDER
BY SAMPLEID;
Result
*** Warning: 7473 Requested sample is larger than table rows. All rows
returned.***
department_number SampleId
----------------- -----------
302 1
201 1
501 1
402 2
301 2
401 2
403 2
600 2
100 3
In
the example, because the rows are not repeated to different sample sets, the supply
of rows is exhausted before the third set can be completed. This results in a
warning that there were not enough rows to populate all samples as requested.