Showing posts with label Teradata Quantile. Show all posts
Showing posts with label Teradata Quantile. Show all posts

Saturday, 20 April 2013

Relationship between COUNT OVER, QUANTILE and PERCENTILE


Using RANK and COUNT window function to generate percentile:

We can use RANK to rank data. By default we provide rank 1 to the highest value as shown below:

select employeeid, salary, rank(salary) from employee;

Employeeid
Salary
Rank(Salary)
144
12345.11
1
124
12345.11
1
12
12000
3
11
11000
4
10
10000
5
9
9000
6
8
8000
7
7
7000
8
6
6000
9
5
5000
10
4
4000
11
3
3000
12
2
2000
13
1
1000
14

We can also change the way in which ranks are assigned.
In above example the highest value gets the 1st rank.

We can make the lowest value to get the rank 1 and highest value the highest rank by adding ASC in the RANK function as shown below:

select employeeid, salary,rank(salary ASC) from employee;

Employeeid
Salary
Rank(Salary ASC)
1
1000
1
2
2000
2
3
3000
3
4
4000
4
5
5000
5
6
6000
6
7
7000
7
8
8000
8
9
9000
9
10
10000
10
11
11000
11
12
12000
12
124
12345.11
13
144
12345.11
13


Now following query divides the RANK with COUNT window output to generate percentile:

select
Employeeid ,
Salary ,
COUNT(*) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS CNTGRP
,rank(salary ASC) AS RNK,
(RNK * 1.00 )/ CNTGRP from employee;

Above query gives the result as follows , which is similar to percentile

Employeeid
Salary
CNTGRP
RNK
((RNK*1.00)/CNTGRP)
1
1000
14
1
0.07
2
2000
14
2
0.14
3
3000
14
3
0.21
4
4000
14
4
0.29
5
5000
14
5
0.36
6
6000
14
6
0.43
7
7000
14
7
0.5
8
8000
14
8
0.57
9
9000
14
9
0.64
10
10000
14
10
0.71
11
11000
14
11
0.79
12
12000
14
12
0.86
124
12345.11
14
13
0.93
144
12345.11
14
13
0.93

Note the use of multiplication with 1.00 . It is done with the intension to convert the result to decimal format, other wise all the values below 1 will be shown as 0 as we are dividing integer by an integer.



COUNT window function and QUANTILE:

The count window function is closely related to quantile function .

Quantile internally makes use of COUNT window function.

The relation can be shown as below:

QUANTILE(n,COLUMN)=  (RANK(COLUMN ASC)-1) * n / COUNT window(COLUMN)

Ex: Percentile
select
Employeeid ,
Salary ,
COUNT(salary) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS CNTGRP
,rank(salary ASC) AS RNK
,QUANTILE (100,salary)
,((RNK -1 ) * 100 )/ CNTGRP from employee;

Employeeid
Salary
CNTGRP
RNK
Quantile(100,Salary)
(((RNK-1)*100)/CNTGRP)
1
1000
14
1
0
0
2
2000
14
2
7
7
3
3000
14
3
14
14
4
4000
14
4
21
21
5
5000
14
5
28
28
6
6000
14
6
35
35
7
7000
14
7
42
42
8
8000
14
8
50
50
9
9000
14
9
57
57
10
10000
14
10
64
64
11
11000
14
11
71
71
12
12000
14
12
78
78
124
12345.11
14
13
85
85
144
12345.11
14
13
85
85

Ex: Quartile
select
Employeeid ,
Salary ,
COUNT(salary) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS CNTGRP
,rank(salary ASC) AS RNK
,QUANTILE (30,salary)
,((RNK -1 ) * 30 )/ CNTGRP from employee;

Employeeid
Salary
CNTGRP
RNK
Quantile(30,Salary)
(((RNK-1)*30)/CNTGRP)
1
1000
14
1
0
0
2
2000
14
2
2
2
3
3000
14
3
4
4
4
4000
14
4
6
6
5
5000
14
5
8
8
6
6000
14
6
10
10
7
7000
14
7
12
12
8
8000
14
8
15
15
9
9000
14
9
17
17
10
10000
14
10
19
19
11
11000
14
11
21
21
12
12000
14
12
23
23
124
12345.11
14
13
25
25
144
12345.11
14
13
25
25

Friday, 29 March 2013

OLAP - Part 8 - Quantiles


Using Quantiles:

  1. Quantiles are used to divide a number of rows into a number of partitions of roughly equal size.

  1. The most common quantile is the percentile which is based on a value of 100.We can also have quartiles, deciles etc.

  1. Syntax:

QUANTILE (quantile_constant,sortlist)

quantile_constant = constant used to define number of quantile partitions.

sortlist = column reference to be used for the partitioning and to sort the output. 

  1. Example:

select employeeid, salary, quantile(10,salary) from employee;

Employeeid
Salary
Quantile(10,Salary)
1
1000
0
2
2000
0
3
3000
1
4
4000
2
5
5000
3
6
6000
4
7
7000
5
8
8000
5
9
9000
6
10
10000
7
11
11000
8
12
12000
9

Note that by default the quantile and quantile column will be output in ascending sequence.


The ORDER BY clause can be used to override the ascending sort default.

QUANTILE(10, salary DESC)

The DESC specification is the default for the QUANTILE function. This might appear to be a contradiction to what shown above but in reality it is not.

In looking at the report generated using QUANTILE (10, salary DESC), the output appears to be sorted by salary ascending. The DESC qualifier on the column indicates only that the column ‘salary’ will descend as the quantile value itself descends, regardless of the requested ordering (i.e. ORDER BY) of the output.

select employeeid, salary, quantile(10,salary desc) from employee;

Employeeid
Salary
Quantile(10,Salary)
1
1000
0
2
2000
0
3
3000
1
4
4000
2
5
5000
3
6
6000
4
7
7000
5
8
8000
5
9
9000
6
10
10000
7
11
11000
8
12
12000
9



In other words, the lowest ‘salary’ value  will correspond to the lowest quantile value (0) and conversely the highest ‘salary’  will correspond to the highest quantile value.
In short, adding DESC is unnecessary as it is the default. 


  1. Using ASC is the non-default option. In this case the column ‘salary’ will ascend as the quantile value descends. In other words, the highest will correspond to the lowest quantile value.

With quantiles, an ascending quantile means that the lowest value gets the highest quantile.

select employeeid, salary, quantile(10,salary ASC) from employee;

Employeeid
Salary
Quantile(10,Salary ASC)
12
12000
0
11
11000
0
10
10000
1
9
9000
2
8
8000
3
7
7000
4
6
6000
5
5
5000
5
4
4000
6
3
3000
7
2
2000
8
1
1000
9

  1. Quantile is a OLAP function and hence cannot be combined with normal aggregate functions.

    Hence a query like below is invalid:

SELECT SUM(salary) from employee qualify quantile(10,salary) > 75;

  1. Quantiles can be used to answer questions like “Show all employee salaries that are in the top 25 % of the company.”

We can have multiple columns in the sortlist.
Example:

                  QUANTILE (100, salary_amount, employee_number)
      
The purpose of the ‘employee_number’ here is to handle a situation where there is a tie on the salary amount. Remember, the ‘salary_amount’ parameter (with a default of DESC) tells us that the salary_amount will descend as the quantile value descends.
     
        The default sort for quantiles is the column descends as the quantile descends. Thus, salary   amount descends as quantile descends. If there is a tie in the salary, then employee number descends as the quantile descends.