Wednesday 1 February 2012

Explain OLAP and Important OLAP functions?

  1. OLAP stands for On-Line Analytical Processing.
  2.  Following are important OLAP functions:
·         RANK - (Rankings) 
·         QUANTILE - (Quantiles) 
·         CSUM - (Cumulation) 
·         MAVG - (Moving Averages) 
·         MSUM - (Moving Sums) 
·         MDIFF - (Moving Differences) 
·         MLINREG - (Moving Linear Regression)  
  1. OLAP functions are similar to aggregate functions in the sense that they operate on groups of rows.However they are different from aggregate functions in the sense that they show detail rows as well. This is not possible with aggregate functions.
=========================================== 
Cumulative Sum:

1.   Cumulative sum (CSUM) computes a running or cumulative total of a column’s value.

2. Syntax: CSUM(columnname,sortlist)

Here the sortlist is used to sort the data by some column before cumulative sum is performed. Default sequence is ascending.

Example:
SELECT EMPLOYEEID , CSUM(SALARY,EMPLOYEEID) FROM EDW_RESTORE_TABLES.TESTY;

    employeeid    CSum(Salary,employeeid)
    1                  10000
    2                  20000
    2                  40000
    4                  60000
    5                  90000
    6                  100000
    7                  110000
    8                  120000
It is very important to note that CSUM needs a sorting sequence which is first used to sort the data in table and then use it to perform cumulative sum on the column mentioned.

SELECT CSUM(salary) from table1 gives an error.

NOte:
SELECT CSUM(salary,salary) from table1 is also a valid query.
Data is first sorted by Salary column and then used to perform CSUM.
 
3.   Example:
SELECT salesdate, sales, CSUM(sales, salesdate)FROM daily_sales;
salesdate           sales              Csum
98/01/01        150.00         150.00
98/01/02        200.00         350.00
98/01/03        250.00         600.00
98/01/05        350.00         950.00
98/01/10        550.00        1500.00
98/01/21        150.00        1650.00
98/01/25        200.00        1850.00
98/01/31        100.00        1950.00
98/02/01        150.00        2100.00
98/02/03        250.00        2350.00
98/02/06        350.00        2700.00
98/02/17        550.00        3250.00
98/02/20        450.00        3700.00
98/02/27        350.00        4050.00
In above example Data is sorted by Salesdate and column sales is used for performing cumulative sum.
 
 ===========================================


Cumulative Sums With Reset:

1.   A cumulative sum may be reset to zero at specified breakpoints.

2.   This is done with the help of ’GROUP BY’ statement. Group BY indicates that when value of column specified in ‘group by’ changes the cumulative sum should be reset.

3.   Also its very important to note that OLAP and standard aggregations (SUM, COUNT,AVG, MIN,MAX) are not compatible within the same query. Since OLAP and aggregate functions cant be used together , GROUP BY serves a separate purpose for each of the types of query.

4.   Example:
SELECT EMPLOYEEID , DEPARTMENTNO,CSUM(SALARY,EMPLOYEEID) FROM EDW_RESTORE_TABLES.TESTY GROUP BY DEPARTMENTNO;
    employeeid    departmentno    CSum(Salary,employeeid)
    1                    100                     10000
    2                    100                     20000
    2                    200                     20000 --> accumulation resets
    4                    200                     40000
    5                    300                     30000
--> accumulation resets
    6                    300                     40000
    7                    400                     10000
--> accumulation resets
    8                    500                     10000--> accumulation resets
 
NOte that we are using GROUP BY. Here the function of group by is different from what it does with aggregate functions. Here group by is used to provide partitioning logic.
 
 ===========================================

Moving Average (MAVG):

1.   Used to calculate Moving average on a column.

2.   The number of rows used for aggregation operation is called as query width.

3.   Syntax:
MAVG(colname, n, sortlist)
colname = the column on which the moving average is computed
n = the number of rows (< 4096) which will be used in the calculation including the current row.
 ('n' is also refered to as the 'width' of the average)
sortlist = the column(s) which determine the sequencing of the rows. Default is ascending.

4.   This function computes the moving AVG of a column based on some number of preceding rows.

5.   If the number of rows preceding the current row is less than the width, the average is computed based on the existing preceding rows.

6.   Example:
SELECT EMPLOYEEID,SALARY,MAVG(SALARY,2,EMPLOYEEID) FROM EDW_RESTORE_TABLES.TESTY 
            employeeid      Salary  MAvg(Salary,2,employeeid)
            1                      10000  10000.00
            2                      10000  10000.00 avg of current row and prev row
            2                      20000  15000.00
            4                      20000  20000.00
            5                      30000  25000.00
            6                      10000  20000.00
            7                      10000  10000.00
            8                      10000  10000.00

 
For computation purpose current row and the preceding n-1 rows are used.
If the number of rows is less than n-1, it uses all preceding rows.
It sorts in ascending order by sortlist column(s) as the default.

  ===========================================

Moving Sum (MSUM):

1.   Used to calculate Moving sum on a column.

2.   The number of rows used for aggregation operation is called as query width.

3.   Syntax:
MSUM(colname, n, sortlist)
colname = the column on which the moving sum is computed
n = the number of rows (< 4096) which will be used in the calculation including the current row.
 ('n' is also refered to as the 'width' of the average)
sortlist = the column(s) which determine the sequencing of the rows. Default is ascending.

4.   Example:
SELECT salesdate, itemid, sales, MSUM(sales, 3, salesdate)
WHERE itemid = 10
FROM daily_sales;
Result
salesdate   itemid          sales               MSum
98/01/01         10      150.00        150.00
98/01/02    10    200.00      350.00       Sum of 2 rows
98/01/03    10    250.00              600.00 Sum of 3 rows starting row 1
98/01/05    10    350.00              800.00 Sum of 3 rows starting row 2
98/01/10     10    550.00             1150.00 Sum of 3 rows starting row 3
98/01/21    10    150.00             1050.00
98/01/25    10    200.00              900.00       Sum of 3 rows
98/01/31    10    100.00              450.00
98/02/01    10    150.00              450.00
98/02/03    10    250.00              500.00
98/02/06    10    350.00              750.00
98/02/17     10    550.00             1150.00
98/02/20     10    450.00             1350.00
98/02/27    10    350.00             1350.00

5.   Moving Sum (MSum) follows the same rules as Moving Averages (MAvg):
·         Uses current row and preceding n-1 rows.
·         Uses all preceding rows if less than n-1.
·         Sort ascending by sortlist column(s) is the default.
=======================================

Moving Differences (MDIFF):

1.   The Moving Difference (MDIFF) function permits a calculation of a moving difference of a specified column, based on a defined query width (n).

2.   The width determines how many rows back to count for the subtrahend (i.e. the number being subtracted).

3.   If there are less than n preceding rows, a null will be generated to represent the difference.

4.   Syntax:
MDIFF(colname, n, sortlist)
colname = the column on which the moving sum is computed
n = the number of rows (< 4096)
sortlist = the column(s) which determine the sequencing of the rows. Default is ascending.

5.   Example:
SELECT salesdate, itemid, sales, MDIFF(sales, 3, salesdate)
FROM daily_sales;
Result
salesdate   itemid        sales MDiff
98/01/01    10     150.00       ? Null because there is no value 3 rows above
98/01/02    10     200.00       ? Null because there is no value 3 rows above
98/01/03    10     250.00       ? Null because there is no value 3 rows above
98/01/05    10     350.00       200.00 Difference this row and row 1
98/01/10    10     550.00       350.00 Difference this row and row 2
98/01/21    10     150.00       -100.00
98/01/25    10     200.00       -150.00
98/01/31    10     100.00       -450.00
98/02/01    10     150.00           .OO
98/02/03    10     250.00         50.00      Difference of 2 rows
98/02/06    10     350.00       250.00
98/02/17    10     550.00       400.00
98/02/20    10     450.00       200.00
98/02/27    10     350.00           .00

6.   The usage of MDIFF is slightly different than MAvg and MSum in that it:
·         Uses current row and preceding nth row
·         Value is null if there is no preceding nth row
·         Sort ascending by sortlist column(s) is default
 ========================================

Rank function:
1.   The rank function allows a column to be ranked either based on high or low order, against other rows in the answer set.
2.   By default, the output will be sorted in descending sequence of the ranking column. This in short means that highest value in the ranked column gives rank 1.
3.   Syntax:
RANK(columnname).
where colname represents the column to be ranked and the descending sort key of the result.
4.   Example:
SELECT storeid, prodid, sales, RANK(sales)
FROM salestbl
WHERE storeid = 1001;
storeid       prodid        ____sales   Rank
   1001      F       150000.00  1
   1001      A      100000.00  2
   1001      C      60000.00  3
   1001      D      35000.00  4
5.   Points to note are:
·  When Ranking is applied - default highest amount is low rank #.
·  The default sort sequence is descending by ranking column (sales).
 
=======================================================

RANK with ‘Qualify’ and ‘GROUP BY’:

1.   The QUALIFY clause allows restriction of which rankings will be output in the final result.
2.   QUALIFY performs like the HAVING clause by requesting a specific range in the output.
3.   Example:
SELECT storeid, prodid, sales, RANK(sales)
FROM salestbl
GROUP BY storeid
QUALIFY rank(sales) <= 3;
Result
storeid       prodid        sales        Rank
1001    F       150000.00      1
1001 A      100000.00  2
1001 C      60000.00  3
1002 A      40000.00  1
1002 C      35000.00  2
1002 D      25000.00  3
1003 B      65000.00  1
1003 D      50000.00  2
1003 A      30000.00  3
4.   Note that here one more feature is being used . ie GROUP BY.
The GROUP BY isn't doing an aggregation. It is actually changing the scope of the query. It also causes an ascending sort.
5.   The default sort sequence is descending by ranking column (sales).
Due to GROUP BY, the sort is by sales ascending within store. With in a given store the sort is descending by the Ranked column .ie sales.
6.   Qualify rank(sales) <= 3 means it only outputs rows that have rank less than or equal to 3. This means top selling products for each store.
========================================

More on ‘GROUP BY’ clause:
1.   The GROUP BY clause may be used in conjunction with a RANK function to change the scope of the ranking. 
    2.   Without a GROUP BY clause, the scope defaults to the RANK column.
IN the the below example the scope is the ranking column itself.
Example:
SELECT storeid
    ,prodid
    ,sales
    ,RANK(sales)
FROM salestbl
QUALIFY RANK(sales) <= 3
;
Result
storeid       prodid        sales         Rank
1001 F       150000.00  1
1001 A      100000.00  2
1003 B      65000.00  3
3.   ‘GROUP BY” adds a major sorting column as well.
Ascending Sort is performed first on the column in GROUP BY and then sorted descending on the ranked column.
Example:
SELECT storeid
     ,prodid
     ,sales
     ,RANK(sales)
FROM salestbl
GROUP BY storeid, prodid
QUALIFY RANK(sales) <= 3
;
Result
storeid       prodid        sales         Rank
1001 A      100000.00  1
1001 C      60000.00  1
1001 D      35000.00  1
1001 F       150000.00  1
1002 A      40000.00  1
1002 C      35000.00  1
1002 D      25000.00  1
1003 A      30000.00  1
1003 B      65000.00  1
1003 C      20000.00  1
1003 D      50000.00  1
Because these three elements make up a ranking group and because the distinct elements store and product only yield a single row, there are no rankings below rank one. This is a case where the scope matches the granularity of the table, thus no reasonable ranking can be produced. 
 ================================
Low order Ranking and ORDER BY clause:

·         Sometimes we may need to use low order ranking .For example: To sell top poorest selling products etc.
·         This is accomplished by changing the default order of the rank column from descending to ascending.
For this we use RANK(column ASC).
·         Example:

SELECT  t.prodid, t.sumsales,  rank(t.sumsales)
FROM (SELECT a.prodid, SUM(a.sales) FROM salestbl  a 
        GROUP BY 1) AS t(prodid, sumsales)
QUALIFY RANK(sumsales ASC) <= 3;

·         We can also use qualify when using rank.
The QUALIFY clause <= 3 means produce a list of the ‘first three’ whatever the sequencing of the list.
So if we are using rank with ascending sort on the column then it produces lowest 3 records.

·         Simpler example is as below:
SELECT SALARY,RANK(salary) FROM EDW_RESTORE_TABLES.TESTX
        salary        Rank(salary)
        50000        1
        40000        2
        30000        3
        20000        4
        10000        5
SELECT SALARY,RANK(salary ASC) FROM EDW_RESTORE_TABLES.TESTX
        salary        Rank(salary ASC)
        10000        1
        20000        2
        30000        3
        40000        4
        50000        5
This it is clear that with ASC the lowest gets the rank 1.
This is any case first row will always have a rank 1.This can be changed by using ORDER BY clause.

When order by clause is used, the ORDER BY becomes the major sort in the results list.

SELECT SALARY,RANK(salary ASC) FROM EDW_RESTORE_TABLES.TESTX order by salary desc;
        salary        Rank(salary ASC)
        50000        5
        40000        4
        30000        3
        20000        2
        10000        1




-====================================================

Using Quantiles:
·         Quantiles are used to divide a number of rows into a number of partitions of roughly equal size.
·         The most common quantile is the percentile which is based on a value of 100.We can also have quartiles,deciles etc.
·         Syntax:
                  QUANTILE (quantile_constant,sortlist)
quantile_constant = constant used to define number of quantile partitions.
sortlist = column reference to be used for the partitioning and to sort the output. 

·         By default the ordering is desc. .ie QUANTILE(quantile_constant,sortlist desc) is same as QUANTILE (quantile_constant,sortlist).
The DESC specification is the default for the QUANTILE function.
·         Example:
QUANTILE(3, sumsales DESC)

      The DESC qualifier on the column indicates only that the column ‘sumsales’ will descend as the quantile value itself descends, regardless of the requested ordering (i.e. ORDER BY) of the output. .ie Lowest will get the lowest quantile.
In short, adding DESC is unnecessary as it is the default.
As always, the ORDER BY clause may be used to reorder the output.
      Example:
      SELECT salary,QUANTILE(100,salary DESC) FROM EDW_RESTORE_TABLES.TESTX

      salary   Quantile(100,salary)

      10000  0

      20000  20

      30000  40

      40000  60

      50000  80


·         Using ASC is the non-default option. In this case the column ‘sumsales’ will ascend as the quantile value descends. In other words, the highest will correspond to the lowest quantile value.
With quantiles, an ascending quantile means that the lowest value gets the highest quantile.

SELECT salary,QUANTILE(100,salary ASC) FROM EDW_RESTORE_TABLES.TESTX



      salary   Quantile(100,salary ASC)

      50000  0

      40000  20

      30000  40

      20000  60

      10000  80


Quantiles can be used to answer questions like “Show all employee salaries that are in the top 25 % of the company.”

·         We can have multiple columns in the sortlist.
Example:

                  QUANTILE (100, salary_amount, employee_number)
      
      The purpose of the ‘employee_number’ here is to handle a situation where there is a tie on the salary amount. Remember, the ‘salary_amount’ parameter (with a default of DESC) tells us that the salary_amount will descend as the quantile value descends.
      The default sort for quantiles is the column descends as the quantile descends. Thus, salary amount descends as quantile descends. If there is a tie in the salary, then employee number descends as the quantile descends.


Using Samples:
·         The SAMPLE function is used to generate samples of data from a table or view.
·         This can be done in two ways:
1.      Sample n where n is integer. It will return n number of rows.(Actual number of rows)
2.      Sample n where n is a fraction. That will return that much fraction of the table.(.ie Percentage of the table)
·         If the number n is greater than the number of rows in the table, the sample will consist of the number of rows in the table.
·         Rows are not reused within the same sample.
·         Note that calculations resulting in fractional rows must be greater than .5 to actually return a row.
In the following example, 25% of the rows of the employee table are to be returned. The employee table has 26 rows.
            26 * .25 = 6.50 = 6 rows in the sample
·         Examples:
SELECT COUNT(*) FROM EDW_RESTORE_TABLES.TESTX
            Count(*)
            5
Here the table has 5 rows.
SELECT *  FROM EDW_RESTORE_TABLES.TESTX SAMPLE 3
            salary
            20000
            30000
            50000
SELECT *  FROM EDW_RESTORE_TABLES.TESTX SAMPLE .5
            salary
            20000
            10000
            30000
·         Multiple sample sets may be generated in a single query if desired. In order to identify the specific set a tag called the SAMPLEID is made available for association with each set. The SAMPLEID may be selected, used for ordering, or used as a column in a new table.
·         When using multiple sets rows are not repeated to different sample sets.
·         SAMPLEID is a keyword that assigns a sample-id to a specific sample set.
·         Example:
SELECT SALARY,SAMPLEID  FROM EDW_RESTORE_TABLES.TESTX SAMPLE .5,0.25
            salary  SampleId
            20000  1
            10000  2
            30000  1
            50000  1
In above case if the sum of the sample fractions increases beyond 1 then it gives an error.
.ie 0.5,0.5,0.25 gives error as it gives beyond 1.
·         Example:
SELECT department_number,SAMPLEID FROM department SAMPLE 3, 5, 8 ORDER BY SAMPLEID;
Result
*** Warning: 7473 Requested sample is larger than table rows. All rows returned.***
department_number     SampleId
-----------------  -----------
              302            1
              201            1
              501            1
              402            2
              301            2
              401            2
              403            2
              600            2
              100            3
In the example, because the rows are not repeated to different sample sets, the supply of rows is exhausted before the third set can be completed. This results in a warning that there were not enough rows to populate all samples as requested.