Teradata FAQ's: SAMPLEID

Showing posts with label SAMPLEID. Show all posts

Monday, 13 May 2013

Teradata Sampling - Randomized sampling

Sampling with randomized allocation:

Sampling with randomized allocation provides ability to sample randomly across the entire system instead of each AMP returning an equal share of the sample.

In most cases, the difference in the data sampled with randomized versus non-randomized sampling will not be immediately apparent.

Following is how we request randomized allocation:

Select employeeid,departmentno,sampleid

From employee2

SAMPLE WITH REPLACEMENT RANDOMIZED ALLOCATION

WHEN departmentno=100 then 4,2

WHEN departmentno=300 then 4,4

END ;

Teradata Sampling-- SAMPLE WITH REPLACEMENT

Sample with Replacement:

Assume a table employee with just 14 rows in it.

Now if we execute the following query we will still get result of only 14 rows.

select *

from employee

sample 50;

Even though we have asked for sample of 50 rows , because the table has only 14 rows we will get only 14 rows in return. This is because SAMPLE by default does not allow duplication of the rows. Once the row is returned then it is taken out of the pool of rows.

However if we use the WITH REPLACEMENT option we get ability to sample the row and return it back to the pool of rows from which it came, making it candidate for samplings.

Ex:

select *

from employee

sample with replacement 50;

Above query will return 50 rows even if it has to repeat the same rows again and again.

Note that here we write 'SAMPLE WITH REPLACEMENT 50' and not 'SAMPLE 50 WITH REPLACEMENT'.

We can combine stratified sampling and sampling with replacement as follows:

Ex 1:

Select employeeid,departmentno,sampleid

From employee2

SAMPLE WITH REPLACEMENT

WHEN departmentno=100 then .40

WHEN departmentno=200 then .25

WHEN departmentno=300 then .88

WHEN departmentno=400 then .35

END ;

Employeeid	DepartmentNo	SampleId
1	100	1
133	100	1
130	100	1
10	200	2
6	200	2
152	300	3
150	300	3
7	300	3
3	300	3
151	300	3
144	400	4
4	400	4
135	400	4

Ex 2:

Select employeeid,departmentno,sampleid

From employee2

SAMPLE WITH REPLACEMENT

WHEN departmentno=100 then 4,2

WHEN departmentno=300 then 4,4

END ;

Employeeid	DepartmentNo	SampleId
131	100	1
5	100	1
1	100	1
130	100	1
131	100	2
1	100	2
151	300	3
7	300	3
3	300	3
150	300	3
150	300	4
3	300	4
151	300	4
7	300	4

We can see that multiple rows are duplicated across samples.

Same row appears in sample 3 and sample 4.

Teradata Sampling -- Stratified sampling

Sample Function Feature:

Sample allows user to specify a sample of results specified by the SELECT statement.

The sample function has following additional features:

Stratified Sampling	Ability to generate different size samples from different groupings of the data.
Replacement Sampling	Ability to sample a row and return it to the pool of rows and making it candidate for sampling.
Randomized Sampling	Ability to sample randomly across the entire system instead of each AMP returning an equal share of the sample.

For this testing lets consider a table EMPLOYEE2 which has following data distribution:

select departmentno, count(*)

from employee2

group by 1 ;

DepartmentNo	Count(*)
100	7
200	7
300	6
400	8

Stratified sampling:

Following is the example of stratified sampling. It allows us to take 25% samples of each department.

Select employeeid,departmentno,sampleid

From employee2

SAMPLE

WHEN departmentno=100 then .25

WHEN departmentno=200 then .25

WHEN departmentno=300 then .25

WHEN departmentno=400 then .25

END ;

Employeeid	DepartmentNo	SampleId
132	100	1
9	100	1
10	200	2
6	200	2
152	300	3
3	300	3
144	400	4
136	400	4

Things to note:

The syntax of SAMPLE is very much similar to CASE statement. Each condition is specified using WHEN ..THEN and it ends with a END.

Rounding off takes place. For department 300 there are only 6 rows in the table. 25% of that comes to 1.5 which is rounded off to 2 and hence we see 2 rows

Following is another example with different percentages for each sample:

Select employeeid,departmentno,sampleid

From employee2

SAMPLE

WHEN departmentno=100 then .40

WHEN departmentno=200 then .25

WHEN departmentno=300 then .05

WHEN departmentno=400 then .35

END ;

Employeeid	DepartmentNo	SampleId
5	100	1
1	100	1
9	100	1
140	200	2
139	200	2
144	400	4
134	400	4
137	400	4

Things to note:

Note that the sample .05 of 6 rows is 0.3 which is not rounded to 1 and hence no row is returned for the department 3.

Friday, 29 March 2013

OLAP - Part 9 - SAMPLE and SAMPLEID

Sampling : SAMPLE and SAMPLEID

The SAMPLE function is used to generate samples of data from a table or view.

This can be done in two ways:

Sample n where n is integer. It will return n number of rows.(Actual number of rows).

If the number n is greater than the number of rows in the table, the sample will consist of the number of rows in the table.

Sample n where n is a fraction. That will return that much fraction of the table.(.ie Percentage of the table). n is a decimal value less than 1.00 and greater than .00

Note that calculations resulting in fractional rows must be greater than .5 to actually return a row.

Example: The table employee has 12 rows. .25 sample means 4.2 rows. As the fraction value is not above .5 only 4 rows will be returned.

A .49 samele means 5.88 rows. As .88 is greater than .5 sample of .49 will return 6 rows.

Rows are not reused within the same sample.

If table has 12 rows . A query like following will return only 12 rows as table has only 12 rows.

select * from employee sample 15;

Check the query as below

Select count(distinct departmentNo) from employee sample 5;

We might think that the above query is creating a sample of 5 records and the find number of distinct department numbers in the sample. However in actual all the rows in the table are considered and number of distinct departments as 4.

To produce correct result we would have to use a derived table as shown below:

Select count(distinct A. departmentno) from

(Select * from employee sample 5) A;

This query gives result as 3 which is the count of distinct department numbers within the sample.

If we run the same query again and again the result may vary as the sample output may vary every time.

Multiple sample sets may be generated in a single query if desired.

In order to identify the specific set a tag called the SAMPLEID is made available for association with each set.

The SAMPLEID may be selected, used for ordering, or used as a column in a new table.

Note that When using multiple sets rows are not repeated to different sample sets.

SAMPLEID is a keyword that assigns a sample-id to a specific sample set.

Example:

Select EMPLOYEEID, SAMPLEID from employee sample .5,.25

Employeeid	SampleId
5	1
10	1
11	1
12	1
1	1
6	1
2	2
4	2
7	2

The sampleid indicates which sample the record belongs to. Here 6 rows belong to sample 1 and 3 rows belong to sample 2.

If the sum of sample increases beyond 1 then it results in an error.

Select EMPLOYEEID, SAMPLEID from employee sample .5,.6;

Error message: SELECT Failed. 5473: SAMPLE clause has invalid set of arguments.

As mentioned earlier rows are not repeated with in different samples, even if the there are not enough rows to return

The employee table has only 12 rows. Following is the result when we run the below query

Select EMPLOYEEID, SAMPLEID from employee sample 6,5,4;

Employeeid	SampleId
5	1
3	1
10	1
4	1
2	1
1	1
11	2
9	2
7	2
6	2
8	2
12	3

We get only 12 rows as output as there are no enough rows in the table. Note that the last sample does not enough rows to return . Hence we only have one row with sampleid 3.

Teradata FAQ's

links

Blog Archive

Monday, 13 May 2013

Teradata Sampling - Randomized sampling

Teradata Sampling-- SAMPLE WITH REPLACEMENT

Teradata Sampling -- Stratified sampling

Friday, 29 March 2013

OLAP - Part 9 - SAMPLE and SAMPLEID