Showing posts with label SAMPLEID. Show all posts
Showing posts with label SAMPLEID. Show all posts

Monday, 13 May 2013

Teradata Sampling - Randomized sampling


Sampling with randomized allocation:

Sampling with randomized allocation provides ability to sample randomly across the entire system instead of each AMP returning an equal share of the sample.

In most cases, the difference in the data sampled with randomized versus non-randomized sampling will not be immediately apparent.

Following is how we request randomized allocation:

Select        employeeid,departmentno,sampleid
From        employee2
SAMPLE WITH REPLACEMENT RANDOMIZED ALLOCATION
WHEN        departmentno=100 then 4,2
WHEN        departmentno=300 then 4,4
END        ;

Teradata Sampling-- SAMPLE WITH REPLACEMENT


Sample with Replacement:

Assume a table employee with just 14 rows in it.
Now if we execute the following query we will still get result of only 14 rows.

select        *
from        employee
sample         50;

Even though we have asked for sample of 50 rows , because the table has only 14 rows we will get only 14 rows in return. This is because SAMPLE by default does not allow duplication of the rows. Once the row is returned then it is taken out of the pool of rows.

However if we use the WITH REPLACEMENT option we get ability to sample the row and return it back to the pool of rows from which it came, making it candidate for samplings.
Ex:

select        *
from        employee
sample        with replacement 50;

Above query will return 50 rows even if it has to repeat the same rows again and again.

Note that here we write 'SAMPLE WITH REPLACEMENT 50' and not 'SAMPLE 50 WITH REPLACEMENT'.

We can combine stratified sampling and sampling with replacement as follows:

Ex 1:

Select        employeeid,departmentno,sampleid
From        employee2
SAMPLE WITH REPLACEMENT        
WHEN        departmentno=100 then .40
WHEN        departmentno=200 then .25
WHEN        departmentno=300 then .88
WHEN        departmentno=400 then .35


END        ;


Employeeid
DepartmentNo
SampleId
1
100
1
133
100
1
130
100
1
10
200
2
6
200
2
152
300
3
150
300
3
7
300
3
3
300
3
151
300
3
144
400
4
4
400
4
135
400
4


Ex 2:

Select        employeeid,departmentno,sampleid
From        employee2
SAMPLE WITH REPLACEMENT        
WHEN        departmentno=100 then 4,2
WHEN        departmentno=300 then 4,4
END        ;

Employeeid
DepartmentNo
SampleId
131
100
1
5
100
1
1
100
1
130
100
1
131
100
2
1
100
2
151
300
3
7
300
3
3
300
3
150
300
3
150
300
4
3
300
4
151
300
4
7
300
4

We can see that multiple rows are duplicated across samples.
Same row appears in sample 3 and sample 4.


Teradata Sampling -- Stratified sampling


Sample Function Feature:

Sample allows user to specify a sample of results specified by the SELECT statement.

The sample function has following additional features:

Stratified Sampling
Ability to generate different size samples from different groupings of the data.
Replacement Sampling
Ability to sample a row and return it to the pool of rows and making it candidate for sampling.
Randomized Sampling
Ability to sample randomly across the entire system instead of each AMP returning an equal share of the sample.

For this testing lets consider a table EMPLOYEE2 which has following data distribution:

select        departmentno, count(*)
from        employee2
group        by 1 ;

DepartmentNo
Count(*)
100
7
200
7
300
6
400
8


Stratified sampling:

Following is the example of stratified sampling. It allows us to take 25% samples of each department.

Select        employeeid,departmentno,sampleid
From        employee2
SAMPLE        
WHEN        departmentno=100 then .25
WHEN        departmentno=200 then .25
WHEN        departmentno=300 then .25
WHEN        departmentno=400 then .25

END        ;

Employeeid
DepartmentNo
SampleId
132
100
1
9
100
1
10
200
2
6
200
2
152
300
3
3
300
3
144
400
4
136
400
4

Things to note:

  1. The syntax of SAMPLE is very much similar to CASE statement.  Each condition is specified using WHEN ..THEN and it ends with a END.

  1. Rounding off takes place. For department 300 there are only 6 rows in the table. 25% of that comes to 1.5 which is rounded off to 2 and hence we see 2 rows

Following is another example with different percentages for each sample:

Select        employeeid,departmentno,sampleid
From        employee2
SAMPLE        
WHEN        departmentno=100 then .40
WHEN        departmentno=200 then .25
WHEN        departmentno=300 then .05
WHEN        departmentno=400 then .35

END        ;

Employeeid
DepartmentNo
SampleId
5
100
1
1
100
1
9
100
1
140
200
2
139
200
2
144
400
4
134
400
4
137
400
4

Things to note:

  1. Note that the sample .05 of 6 rows is 0.3 which is not rounded to 1 and hence no row is returned for the department 3.

Friday, 29 March 2013

OLAP - Part 9 - SAMPLE and SAMPLEID


Sampling : SAMPLE and SAMPLEID

  1. The SAMPLE function is used to generate samples of data from a table or view.
     
  1. This can be done in two ways:

  • Sample n where n is integer. It will return n number of rows.(Actual number of rows).
If the number n is greater than the number of rows in the table, the sample will consist of the number of rows in the table.
 
  • Sample n where n is a fraction. That will return that much fraction of the table.(.ie Percentage of the table).  n is a decimal value less than 1.00 and greater than .00

Note that calculations resulting in fractional rows must be greater than .5 to actually return a row.

Example: The table employee has 12 rows. .25 sample means 4.2 rows. As the fraction value is not above .5 only 4 rows will be returned.

A .49 samele means 5.88 rows. As .88 is greater than .5 sample of .49 will return 6 rows.


  1. Rows are not reused within the same sample.

If table has 12 rows . A query like following will return only 12 rows as table has only 12 rows.

select * from employee sample 15;


  1. Check the query as below

Select count(distinct departmentNo) from employee sample 5;

We might think that the above query is creating a sample of 5 records and the  find number of distinct department numbers in the sample. However in actual all the rows in the table are considered and number of distinct departments as 4.

To produce correct result we would have to use a derived table as shown below:

Select count(distinct A. departmentno)  from
(Select * from employee sample 5) A;

This query gives result as 3 which is the count of distinct department numbers within the sample.

If we run the same query again and again the result may vary as the sample output may vary every time.

  1. Multiple sample sets may be generated in a single query if desired.

    In order to identify the specific set a tag called the SAMPLEID is made available for association with each set.

The SAMPLEID may be selected, used for ordering, or used as a column in a new table.

Note that When using multiple sets rows are not repeated to different sample sets.

SAMPLEID is a keyword that assigns a sample-id to a specific sample set.

Example:

Select EMPLOYEEID, SAMPLEID from employee sample .5,.25

Employeeid
SampleId
5
1
10
1
11
1
12
1
1
1
6
1
2
2
4
2
7
2

The sampleid indicates which sample the record belongs to. Here 6 rows belong to sample 1 and 3 rows belong to sample 2.

If the sum of sample increases beyond 1 then it results in an error.

Select EMPLOYEEID, SAMPLEID from employee sample .5,.6;

Error message: SELECT Failed. 5473:  SAMPLE clause has invalid set of arguments. 

As mentioned earlier rows are not repeated with in different samples, even if the there are not enough rows to return

The employee table has only 12 rows. Following is the result when we run the below query

Select EMPLOYEEID, SAMPLEID from employee sample 6,5,4;

Employeeid
SampleId
5
1
3
1
10
1
4
1
2
1
1
1
11
2
9
2
7
2
6
2
8
2
12
3

We get only 12 rows as output as there are no enough rows in the table. Note that the last sample does not enough rows to return . Hence we only have one row with sampleid 3.