Friday 29 March 2013

OLAP - Part 9 - SAMPLE and SAMPLEID


Sampling : SAMPLE and SAMPLEID

  1. The SAMPLE function is used to generate samples of data from a table or view.
     
  1. This can be done in two ways:

  • Sample n where n is integer. It will return n number of rows.(Actual number of rows).
If the number n is greater than the number of rows in the table, the sample will consist of the number of rows in the table.
 
  • Sample n where n is a fraction. That will return that much fraction of the table.(.ie Percentage of the table).  n is a decimal value less than 1.00 and greater than .00

Note that calculations resulting in fractional rows must be greater than .5 to actually return a row.

Example: The table employee has 12 rows. .25 sample means 4.2 rows. As the fraction value is not above .5 only 4 rows will be returned.

A .49 samele means 5.88 rows. As .88 is greater than .5 sample of .49 will return 6 rows.


  1. Rows are not reused within the same sample.

If table has 12 rows . A query like following will return only 12 rows as table has only 12 rows.

select * from employee sample 15;


  1. Check the query as below

Select count(distinct departmentNo) from employee sample 5;

We might think that the above query is creating a sample of 5 records and the  find number of distinct department numbers in the sample. However in actual all the rows in the table are considered and number of distinct departments as 4.

To produce correct result we would have to use a derived table as shown below:

Select count(distinct A. departmentno)  from
(Select * from employee sample 5) A;

This query gives result as 3 which is the count of distinct department numbers within the sample.

If we run the same query again and again the result may vary as the sample output may vary every time.

  1. Multiple sample sets may be generated in a single query if desired.

    In order to identify the specific set a tag called the SAMPLEID is made available for association with each set.

The SAMPLEID may be selected, used for ordering, or used as a column in a new table.

Note that When using multiple sets rows are not repeated to different sample sets.

SAMPLEID is a keyword that assigns a sample-id to a specific sample set.

Example:

Select EMPLOYEEID, SAMPLEID from employee sample .5,.25

Employeeid
SampleId
5
1
10
1
11
1
12
1
1
1
6
1
2
2
4
2
7
2

The sampleid indicates which sample the record belongs to. Here 6 rows belong to sample 1 and 3 rows belong to sample 2.

If the sum of sample increases beyond 1 then it results in an error.

Select EMPLOYEEID, SAMPLEID from employee sample .5,.6;

Error message: SELECT Failed. 5473:  SAMPLE clause has invalid set of arguments. 

As mentioned earlier rows are not repeated with in different samples, even if the there are not enough rows to return

The employee table has only 12 rows. Following is the result when we run the below query

Select EMPLOYEEID, SAMPLEID from employee sample 6,5,4;

Employeeid
SampleId
5
1
3
1
10
1
4
1
2
1
1
1
11
2
9
2
7
2
6
2
8
2
12
3

We get only 12 rows as output as there are no enough rows in the table. Note that the last sample does not enough rows to return . Hence we only have one row with sampleid 3.

1 comment:

  1. "If the number n is greater than the number of rows in the table, the sample will consist of the number of rows in the table."

    pls revisit this comment

    ReplyDelete