Monday 13 May 2013

Teradata Sampling -- Stratified sampling


Sample Function Feature:

Sample allows user to specify a sample of results specified by the SELECT statement.

The sample function has following additional features:

Stratified Sampling
Ability to generate different size samples from different groupings of the data.
Replacement Sampling
Ability to sample a row and return it to the pool of rows and making it candidate for sampling.
Randomized Sampling
Ability to sample randomly across the entire system instead of each AMP returning an equal share of the sample.

For this testing lets consider a table EMPLOYEE2 which has following data distribution:

select        departmentno, count(*)
from        employee2
group        by 1 ;

DepartmentNo
Count(*)
100
7
200
7
300
6
400
8


Stratified sampling:

Following is the example of stratified sampling. It allows us to take 25% samples of each department.

Select        employeeid,departmentno,sampleid
From        employee2
SAMPLE        
WHEN        departmentno=100 then .25
WHEN        departmentno=200 then .25
WHEN        departmentno=300 then .25
WHEN        departmentno=400 then .25

END        ;

Employeeid
DepartmentNo
SampleId
132
100
1
9
100
1
10
200
2
6
200
2
152
300
3
3
300
3
144
400
4
136
400
4

Things to note:

  1. The syntax of SAMPLE is very much similar to CASE statement.  Each condition is specified using WHEN ..THEN and it ends with a END.

  1. Rounding off takes place. For department 300 there are only 6 rows in the table. 25% of that comes to 1.5 which is rounded off to 2 and hence we see 2 rows

Following is another example with different percentages for each sample:

Select        employeeid,departmentno,sampleid
From        employee2
SAMPLE        
WHEN        departmentno=100 then .40
WHEN        departmentno=200 then .25
WHEN        departmentno=300 then .05
WHEN        departmentno=400 then .35

END        ;

Employeeid
DepartmentNo
SampleId
5
100
1
1
100
1
9
100
1
140
200
2
139
200
2
144
400
4
134
400
4
137
400
4

Things to note:

  1. Note that the sample .05 of 6 rows is 0.3 which is not rounded to 1 and hence no row is returned for the department 3.

1 comment: