Sunday, 17 November 2013

11.1 Teradata Parallel Transporter - Stream Operator - Basics

Stream Operator Capabilities:

  • Stream Operator is a Consumer Operator.

  • Stream Operator emulates Teradata Tpump Utility.

  • Stream Operator is used to perform high-speed parallel Inserts, Updates, Deletes, and Upserts in a near-real-time to one or more empty or preexisting Teradata Database tables without locking target tables.
    (note that locking is different from Load,Update operators)

  • Row level locking allows load operations in the background during normal system use.

  • The Stream operator instances can update up to 127 tables on the Teradata Database.

  • Multiple parallel instances can be used to improve the performance of the update.






Stream Operator Definition:

Following is a Stream operator definition.

DEFINE OPERATOR STREAM_LOAD_OPERATOR
DESCRIPTION 'STREAM OPERATOR FOR Teradata PT’
TYPE STREAM
SCHEMA *
ATTRIBUTES
(
VARCHAR TdpId,
VARCHAR UserName,
VARCHAR UserPassword,
VARCHAR LogTable,
VARCHAR ErrorTable,   --> Note that only one error table
VARCHAR AppendErrorTable,
VARCHAR DropErrorTable,
VARCHAR PrivateLogName
);



TDPID :
It’s a optional Security attribute that specifies the name of the Teradata database machine(non-mainframe platforms) or TDP(mainframe platforms) for Stream Operator.
 
If we do not specify the value for the TdpId attribute, the operator uses the default TdpID established for the user by the system administrator.
UserName:
Attribute that specifies the user name on the Teradata Database.
UserPassword :
Attribute that specifies the password associated with the user name.
PrivateLogName
Optional attribute that specifies the name of a log that is maintained by the Teradata PT

Logger inside the public log. The private log contains all of the messages produced by the operator.



LogTable:

A restart log table  contains restart information written during job runs
A Log Table is required for  any execution of the Stream operator.

This is specified using the LogTable attribute.

Restarts can occur following any unexpected error on a database.

For example: If a table runs out of available space during a load operation, the job terminates, the table is paused, and a checkpoint is recorded in the restart log. Pausing the job in this way allows you to manually increase the available space, if needed, then restart the job because the load operation can restart a job from the last checkpoint in the restart log.





ErrorTable:

The ErrorTable is specified using the ErrorTable Attribute. Note that there is only 1 error table.

This table contains information concerning data conversion errors, constraint violations, and other error conditions.

This attribute must be a new table name, unless the AppendErrorTable attribute is set to YES.

If the name is not supplied, it is created by the Stream operator. However note that user supplied names for error tables must not exceed 30 bytes.

By default, the Stream operator drops the error table after data loading, unless the error table is non-empty.
However the error table can be kept after data loading by setting the
DropErrorTable attribute to NO.
Keeping the error table allows the error table to be re-used for future jobs and saves the time required to drop the table.

By default, the Stream Operator will create the error table. If the error table already exists, Stream Operator will terminate the job with an error message. The exception is on a job restart. On a job restart, Stream Operator will continue.

  • When the value of the AppendErrorTable attributes is YES, the Stream operator reuses the error table and displays the number of rows already existing in the error table.

  • If the error table does not exist, the Stream operator creates the error table before data loading and continue.

  • If the structure of the existing error table is not compatible with the error table the Stream operator expects, it terminates the job with an error message before data loading.









Error Limit:

Attribute that specifies the approximate number of records that can be stored in the error table before the Stream operator job is terminated.

This number is approximate because the Stream operator sends multiple rows of data at a time to Teradata Database. By the time Teradata PT processes the message indicating that the error limit has been exceeded, it may have loaded more records into the error table than the actual number specified in the error limit.

Restriction : The value greater than 0.
 
Invalid Value : will cause the Stream operator job to terminate.
 
Default : ErrorLimit value is unlimited.

Note : The ErrorLimit specification applies to each instance of the Stream operator







Using Database Macros:

The Stream operator uses macros to modify tables rather than actual DML statements.

Before beginning a load, the operator creates macros to represent the DML statements.
The macros are then iteratively executed in place of the DML statements.

The Stream operator also removes the macros after all rows are loaded.

The MacroDatabase attribute to specify the database that contains the macros created by the Stream operator.

Note : If not specified, the database used is the database that contains the Restart Log table.







Multiple Statement Requests:

The most important technique used by the Stream operator to improve performance is the multiple statement request. Placing more statements in a single request is beneficial for two reasons.

It reduces:
• Network overhead because large messages are more efficient than small ones.
Robust mode recovery overhead to one extra restart log table row for each request. ( In robust mode a log entry is made per request. Larger the multistatement request less logs need to be made. If one of the SQL in the multistatement request fails the entire request is rolled back)

The Stream operator packs multiple statements into a request based upon the Pack attribute specification.







SERIALIZE:

The Serialize option only applies to the Stream operator.


We use the Serialize option
when correct sequencing of transactions is required.


Using the Serialize option in APPLY statements, the Stream operator ensures that operations for a given row occur in the order they are received from the input stream.  Note that we specify the SERIALIZE option in the APPLY statements.


Restriction:  is that when using the Serialize option, only one instance of the Stream operator is allowed. Specifying more than one instance causes the Stream operator to terminate with an error.




Default:

Default is OFF unless the job contains an upsert operation, which causes Serialize to switch the default to ON.


When the Serialize option is set to OFF, transactions are processed in the order they are encountered, then they are placed in the first available buffer. Buffers are sent to parsing engine (PE) sessions and PEs process the data independently of other PEs. In other words, transactions might occur in any order.



Sequencing Key:

To use this option, associate a sequencing key (usually the primary index) with the target table. Each input data row is hashed based on the key to determine the session assigned to process each input row. This allows all rows with the same key to be processed in sequence by the same session, which is especially important if rows are distributed among many sessions.

If the Serialize option is set to ON, operations on a row occur serially in the order submitted.

The sequencing key of SERIALIZE ON is specified as one or more column names occurring in the input data SCHEMA definition. These SCHEMA columns are collectively referred to as the key.

Usually the key is the primary index of the table being updated, but it can be a different column or set of columns. For example:

APPLY
('UPDATE emp SET dept_name = :dept_name
WHERE empno = :empno;')
SERIALIZE ON (empno)
TO TARGET_TABLE[1]

This APPLY statement guarantees that all data rows with the same key (empno) are applied to the database in the same order received they are received from the producer operator. In this case, the column empno is the primary index of the Emp table.


Note: that SERIALIZE ON is local to a specific DML statement. In the following example, a group DML is specified, but only the first statement uses the Serialize option:

APPLY
('UPDATE emp SET dept_num = :dept_num
WHERE empno = :empno; ')
SERIALIZE ON (empno)
('UPDATE dept SET dept_name = :dept_name
WHERE deptno = :deptno; ')
TO TARGET_TABLE[1]

Following are some of the advantages to using the Serialize option, and might improve performance:

• SERIALIZE ON can eliminate the lock delays or potential deadlocks caused by primary index collisions coming from multiple sessions.
• SERIALIZE ON can also reduce deadlocks when rows with non-unique primary index values are processed.









Robust and Non-Robust Mode:

Robust Mode causes every DML operation to be check pointed and ensures on restart that no operation is applied more than once.

The robust mode requires more writes to a restart log, which might impact performance more, however, using robust mode ensures that a restart avoids reprocessing rows that a normal interval checkpoint might necessitate.

Robust is the default mode for all Stream operator jobs. The Robust attribute turns the mode on or off.

Setting the attribute to “yes” tells the Stream operator to use robust restart logic.

VARCHAR Robust = 'Yes' (or 'Y')

Robust mode causes a row to be written to the log table each time a buffer successfully completes its updates.

Mini-checkpoints are written for each successfully processed row.
These mini-checkpoints are deleted from the log when a checkpoint is taken, and are used at restart to identify the rows that have been successfully processed, which permits them to be bypassed at restart.

The larger the pack factor, the less overhead is involved in this activity.

Choosing the Robust mode is particularly useful to avoid problems with data integrity.

Robust mode is recommended in the following situations to avoid having an adverse affect on restarts:

• INSERTs into multi-set tables - Robust mode prevents the insertion of duplicate rows, which could insert the same row a second time.

• UPDATEs based on calculations- Robust mode prevents the duplicate application of calculations.

• Large Pack factors - Robust mode does not involve the application and rejection of duplicate rows after restarts, which is a time-consuming process of logging errors to the error table.

• Time-stamped data - Robust mode prevents the possibility of stamping identical rows with different time stamps, resulting in duplicate rows.


Setting the attribute to “no” tells the Stream operator to use simple restart logic rather than robust logic.

VARCHAR Robust = 'No' (or 'N')

In a non-robust mode, restarts begin where the last checkpoint occurs in a job. Because some additional processing will most likely take place after the checkpoint is written, the requests that occur after the checkpoint are resubmitted by the Stream operator as part of the restart process.

Re-attempting updates can also be a problem if update calculation, for example, is based on existing data in the row, such as adding 10% to an amount. Doing the update calculation a second time add an additional 10% to the amount, thus compromising data integrity. In this type of update, it is best to use robust mode to ensure that no DML operation is applied more than once.

The non-robust (or simple restart) method does not involve the extra overhead that comes with the additional inserts to the restart log table that are needed for robust logic, so overall processing is notably faster.






Data Quality Affects Performance:

If data contains errors, a large Pack factor can slow performance because of the way Stream handles errors.

For example, if several hundred statements are packed, when an error occurs, the entire request is rolled back. The Stream operator then removes the error-producing statement(s) and then reissues the entire request. Such a process can be costly from a performance standpoint.





Statement Packing

To provide optimal performance, the Stream operator packs individual DML statements into a larger multistatement request based on the rate specified by the Pack attribute. This type of processing requires less overhead than multiple individual requests.

The Pack attribute specifies the number of statements in a multistatement request. Specifying a Pack rate improves network/channel efficiency by reducing the number of sends and receives between Teradata PT and the Teradata Database.

Maximum of 2400 statements can be specified.
 
Trial and error might be required to determine the best Pack rate for a Stream job. As the Pack rate is increased, the throughput improvement is usually great at first, then falls off.
Ex:
Going from a Pack rate of 1 to 2 could provide huge performance gains, and going from 2 to 4 could be just as beneficial, but moving from 8 to 16 might cause a performance drop.


If the PackMaximum attribute is set to 'Yes', the Stream operator determines the maximum pack for the job, and then reports it.

Two factors to consider are:
 
• The maximum Pack factor based on Stream operator restrictions
• The optimal Pack factor for a particular job

These two factors might not be equal. The maximum rate lets you know the upper limit, but performance might improve at a smaller rate. For this reason, it is recommended that PACKMAXIMUM not be used for production jobs until you determine the optimal Pack factor.



Tuning the Pack Factor:

To determine the ideal pack factor to specify in the Pack attribute, first use the PackMaximum attribute by setting it to 'Yes'. Setting this attribute to Yes on the first job run sets up iterative interactions with the Teradata Database to heuristically determine the maximum possible pack factor.

At the end of the run, this value is displayed in the Stream operator’s logged output. Specify that determined value in the Pack attribute on subsequent runs. Set the PackMaximum to 'No'.




Array Support

The Array Support feature allows DML requests containing only a single statement to be executed once for each of multiple rows of input data, each row specified being one of the members of the array. The DML must be contained within an APPLY statement that includes Stream operator and does not set ArraySupport to Off.

Use of array support improves performance by:
• Increasing the amount of data that can be sent per request from a total of approximately 64KB to a total of approximately 1MB, with a limit of 64KB per input data row.



Latency Interval:

Latency is the interval value, expressed in seconds, between the flushing of stale buffers.
Latency interval is an option that is exclusively used by the Stream operator.

In normal operations (without latency), the Stream operator reads data from the data stream until its buffer is full, then it writes all buffered rows to the Teradata Database. The data is written to the Teradata Database only when the buffer is full or when a checkpoint is taken.

However, a latency interval (for example, set to 5400 seconds) causes the following:

• The Stream operator reads data from the data stream, and empties its buffer, writing the contents to the Teradata Database every 90 minutes (5400 seconds) regardless of whether it is full.
 
• If the buffer fills up within the time period (in this case, 90 minutes), it writes to the

Teradata Database as it would during normal operation.

To set the latency interval, use the following syntax:

tbuild -l <LatencyInterval> -f <file name>

The value used for the latency interval must be a non-zero unsigned integer. The guiding factor is how stale you are willing to allow data to be.

For example, to run a continual load script with a latency interval of two hours, enter:
tbuild -l 7200 -f continualload

(this may be used when the rows come in at very slow rate.. That time we can decide maximum no of mins before applying the rows)




Operational Considerations:

NoPI Tables
Operations other than Insert on target tables defined as NoPI are very non-performant, unless there is an appropriate secondary index path to the data rows.


Sessions and Instances:

Because the Stream operator uses Teradata SQL sessions in its communication with Teradata Database, it does not use load tasks. The Stream operator provides continuous updates using row level locking, allowing constant load operations in the background during normal system use.

Both a minimum and a maximum number of sessions can be used by the Stream operator.
The minimum specification is one. The default is one session for each operator instance.
The number of sessions is evenly distributed among the number of operator instances. If 20 sessions are requested and four instances of the operator are invoked, then each instance will receive five of the sessions.


Checkpointing and Restarting:

Checkpoint options control how often a row is written to the checkpoint file for the purposes of restarting a job. Unless otherwise specified, a checkpoint is taken at the start of and at the end of the input data. Since this process does not provide granular restartability in the case of longer running jobs, checkpoint intervals can be user-specified in terms of minutes or seconds.

• Seconds - Use the command line option -z. For example, the following command indicates that a checkpoint will be taken every 30 seconds:

tbuild -f <script file name> -z 30

• Minutes or seconds - Specify as part of a DEFINE JOB statement. For example:

DEFINE JOB test_job
SET CHECKPOINT INTERVAL 30 SECONDS
or
SET CHECKPOINT INTERVAL 10 MINUTES


DML Option Categories

The DML Options can be classified into four logical categories, which can be specified in any order:
• ARRAY SUPPORT
• SERIALIZE
USE
• DML ERROR OPTIONS, which include:
  1. MARK
  1. IGNORE
  1. INSERT FOR

The following restrictions apply:
• Multiple ARRAY SUPPORT Options cannot be specified.
• Multiple SERIALIZE Options cannot be specified.
• Multiple USE Options cannot be specified.
• Multiple DML Error Options (MARK,IGNORE, INSERT FOR) can be specified in one block. However these cannot be interleaved with the other three DML Option categories.


No comments:

Post a Comment