Stream Operator Capabilities:
- Stream Operator is a Consumer Operator.
- Stream Operator emulates Teradata Tpump Utility.
- Stream Operator
is used to perform high-speed parallel
Inserts, Updates, Deletes, and Upserts in a near-real-time to one or more empty or preexisting
Teradata Database tables without locking
target tables.
(note that locking is different from Load,Update operators)
- Row level locking allows load operations in the background during normal system use.
- The Stream operator instances can update up to 127 tables on the Teradata Database.
- Multiple parallel instances can be used to improve the performance of the update.
Stream Operator Definition:
Following
is a Stream operator definition.
DEFINE
OPERATOR STREAM_LOAD_OPERATOR
DESCRIPTION
'STREAM OPERATOR FOR Teradata PT’
TYPE STREAM
SCHEMA
*
ATTRIBUTES
(
VARCHAR
TdpId,
VARCHAR
UserName,
VARCHAR
UserPassword,
VARCHAR
LogTable,
VARCHAR ErrorTable, --> Note that only one error table
VARCHAR AppendErrorTable,
VARCHAR DropErrorTable,
VARCHAR
PrivateLogName
);
TDPID :
|
It’s
a optional Security attribute that specifies the name of the Teradata
database machine(non-mainframe platforms) or TDP(mainframe platforms) for
Stream Operator.
If we
do not specify the value for the TdpId attribute, the operator uses the
default TdpID established for the user by the system administrator.
|
UserName:
|
Attribute
that specifies the user name on the Teradata Database.
|
UserPassword :
|
Attribute
that specifies the password associated with the user name.
|
PrivateLogName
|
Optional
attribute that specifies the name of a log that is maintained by the Teradata
PT
Logger
inside the public log. The private log contains all of the messages produced
by the operator.
|
LogTable:
A
restart log table contains restart
information written during job runs
A Log
Table is required for any execution of
the Stream operator.
This is specified using the LogTable attribute.
Restarts
can occur following any unexpected error on a database.
For example: If a
table runs out of available space during a load operation, the job terminates,
the table is paused, and a checkpoint is recorded in the restart log. Pausing
the job in this way allows you to manually increase the available space, if
needed, then restart the job because the load operation can restart a job from
the last checkpoint in the restart log.
ErrorTable:
The ErrorTable is specified using the ErrorTable
Attribute. Note
that there is only 1 error table.
This
table contains information concerning data conversion errors, constraint
violations, and other error conditions.
This attribute must be a new table name, unless the AppendErrorTable attribute is set to
YES.
If the
name is not supplied, it is created by the Stream operator. However note that
user supplied names for error tables must not exceed 30 bytes.
By default, the Stream operator drops the error table
after data loading, unless the error table is non-empty.
However the error table can be kept after data loading by setting the DropErrorTable attribute to NO.
However the error table can be kept after data loading by setting the DropErrorTable attribute to NO.
Keeping
the error table allows the error table to be re-used for future jobs and saves
the time required to drop the table.
By default, the Stream Operator will create the error
table. If
the error table already exists, Stream Operator will terminate the job with an
error message. The exception is on a job
restart. On a job restart, Stream Operator will continue.
- When the value of the AppendErrorTable attributes is YES, the Stream operator reuses the error table and displays the number of rows already existing in the error table.
- If the error table does not exist, the Stream operator creates the error table before data loading and continue.
- If the structure of the existing error table is not compatible with the error table the Stream operator expects, it terminates the job with an error message before data loading.
Error Limit:
Attribute that specifies the approximate number of
records that can be stored in the error table before the Stream operator job is
terminated.
This number is
approximate because the Stream operator
sends multiple rows of data at a time to Teradata Database. By the time
Teradata PT processes the message indicating that the error limit has been
exceeded, it may have loaded more records into the error table than the actual
number specified in the error limit.
Restriction : The value greater than 0.
Invalid Value : will
cause the Stream operator job to terminate.
Default : ErrorLimit value is unlimited.
Note : The ErrorLimit specification applies to each
instance of the Stream operator
Using Database Macros:
The Stream operator
uses macros to modify tables rather than
actual DML statements.
Before
beginning a load, the operator creates macros to represent the DML statements.
The macros are then iteratively executed in place of the DML statements.
The Stream operator also removes the macros after all rows are
loaded.
The MacroDatabase attribute to specify the
database that contains the macros created by the Stream operator.
Note : If not specified, the database used is the database that contains the Restart Log table.
Multiple Statement Requests:
The
most important technique used by the Stream operator to improve performance is
the multiple statement request. Placing more statements in a single request is
beneficial for two reasons.
It
reduces:
•
Network overhead because large messages are more efficient than small ones.
• Robust
mode recovery overhead to one extra restart log table row for each request. ( In robust mode a log entry is made per request. Larger
the multistatement request less logs need to be made. If one of the SQL in the
multistatement request fails the entire request is rolled back)
The Stream operator
packs multiple statements into a request based upon the Pack attribute specification.
SERIALIZE:
The Serialize option only applies to the Stream operator.
We use the Serialize option when correct sequencing of transactions is required.
Using the Serialize option in APPLY statements, the Stream operator ensures that operations for a given
row occur in the order they are received from the input stream. Note that we specify the SERIALIZE option in
the APPLY statements.
Restriction: is that when using
the Serialize option, only one instance of the Stream operator is allowed. Specifying more than one instance causes the Stream
operator to terminate with an error.
Default:
Default is OFF unless the job contains an upsert operation, which
causes Serialize to switch the default to ON.
When the Serialize option is set to OFF, transactions are
processed in the order they are encountered, then they are placed in the first
available buffer. Buffers are sent to parsing engine (PE) sessions and PEs
process the data independently of other PEs. In other words, transactions might
occur in any order.
Sequencing
Key:
To use this option, associate a sequencing key (usually the primary
index) with the target table. Each input
data row is hashed based on the key to determine the session assigned to
process each input row. This allows all rows with the same key to be processed
in sequence by the same session, which is especially important if rows are
distributed among many sessions.
If the
Serialize option is set to ON, operations on a row occur serially in the order
submitted.
The sequencing key of SERIALIZE ON is specified as one or
more column names occurring in the input data SCHEMA definition. These SCHEMA columns
are collectively referred to as the key.
Usually the key is the primary index of the table being updated, but it can be a different column or set of columns. For example:
Usually the key is the primary index of the table being updated, but it can be a different column or set of columns. For example:
APPLY
('UPDATE
emp SET dept_name = :dept_name
WHERE
empno = :empno;')
SERIALIZE ON (empno)
TO
TARGET_TABLE[1]
This
APPLY statement guarantees that all data rows with the same key (empno) are
applied to the database in the same order received they are received from the
producer operator. In this case, the column empno is the primary index of the
Emp table.
Note: that SERIALIZE ON is local to a specific DML statement. In the following example, a group DML is specified, but
only the first statement uses the Serialize option:
APPLY
('UPDATE
emp SET dept_num = :dept_num
WHERE
empno = :empno; ')
SERIALIZE
ON (empno)
('UPDATE
dept SET dept_name = :dept_name
WHERE
deptno = :deptno; ')
TO
TARGET_TABLE[1]
Following
are some of the advantages to using the Serialize option, and might improve
performance:
•
SERIALIZE ON can eliminate the lock delays or potential deadlocks caused by
primary index collisions coming from multiple sessions.
•
SERIALIZE ON can also reduce deadlocks when rows with non-unique primary index
values are processed.
Robust and Non-Robust Mode:
Robust
Mode causes every DML operation to be check pointed and ensures on restart that
no operation is applied more than once.
The
robust mode requires more writes to a restart log, which might impact
performance more, however, using robust mode ensures that a restart avoids
reprocessing rows that a normal interval checkpoint might necessitate.
Robust is the default
mode for all Stream operator jobs. The
Robust attribute turns the mode on or off.
Setting
the attribute to “yes” tells the Stream operator to use robust restart logic.
VARCHAR
Robust = 'Yes' (or 'Y')
Robust
mode causes a row to be written to the log table each time a buffer
successfully completes its updates.
Mini-checkpoints are written for each successfully processed row.
These
mini-checkpoints are deleted from the log when a checkpoint is taken, and are
used at restart to identify the rows that have been successfully processed,
which permits them to be bypassed at restart.
The
larger the pack factor, the less overhead is involved in this activity.
Choosing
the Robust mode is particularly useful to avoid problems with data integrity.
Robust
mode is recommended in the following situations to avoid having an adverse
affect on restarts:
• INSERTs into
multi-set tables - Robust mode prevents the
insertion of duplicate rows, which could
insert the same row a second time.
• UPDATEs based on calculations- Robust mode prevents the
duplicate application of calculations.
• Large
Pack factors - Robust mode does not involve the application and rejection of
duplicate rows after restarts, which is a time-consuming process of logging
errors to the error table.
•
Time-stamped data - Robust mode prevents the possibility of stamping identical
rows with different time stamps, resulting in duplicate rows.
Setting
the attribute to “no” tells the Stream operator to use simple restart logic
rather than robust logic.
VARCHAR
Robust = 'No' (or 'N')
In a
non-robust mode, restarts begin where the last checkpoint occurs in a job.
Because some additional processing will most likely take place after the
checkpoint is written, the requests that occur after the checkpoint are
resubmitted by the Stream operator as part of the restart process.
Re-attempting
updates can also be a problem if update calculation, for example, is based on
existing data in the row, such as adding 10% to an amount. Doing the update
calculation a second time add an additional 10% to the amount, thus
compromising data integrity. In this type of update, it is best to use robust
mode to ensure that no DML operation is applied more than once.
The
non-robust (or simple restart) method does not involve the extra overhead that
comes with the additional inserts to the restart log table that are needed for
robust logic, so overall processing is notably faster.
Data Quality Affects Performance:
If data
contains errors, a large Pack factor can slow performance because of the way
Stream handles errors.
For
example, if several hundred statements are packed, when an error occurs, the
entire request is rolled back. The Stream operator then removes the
error-producing statement(s) and then reissues the entire request. Such a
process can be costly from a performance standpoint.
Statement Packing
To provide optimal performance, the Stream operator packs individual DML statements into a
larger multistatement request based on the rate specified by the Pack
attribute. This type of processing requires
less overhead than multiple individual requests.
The
Pack attribute specifies the number of statements in a multistatement request.
Specifying a Pack rate improves network/channel efficiency by reducing the
number of sends and receives between Teradata PT and the Teradata Database.
Maximum of 2400 statements can be specified.
Trial
and error might be required to determine the best Pack rate for a Stream job.
As the Pack rate is increased, the throughput improvement is usually great at
first, then falls off.
Ex:
Going
from a Pack rate of 1 to 2 could provide huge performance gains, and going from
2 to 4 could be just as beneficial, but moving from 8 to 16 might cause a
performance drop.
If the PackMaximum attribute is
set to 'Yes', the Stream operator determines the maximum pack for the job, and
then reports it.
Two
factors to consider are:
• The
maximum Pack factor based on Stream operator restrictions
• The
optimal Pack factor for a particular job
These
two factors might not be equal. The maximum rate lets you know the upper limit,
but performance might improve at a smaller rate. For this reason, it is
recommended that PACKMAXIMUM not be used for production jobs until you
determine the optimal Pack factor.
Tuning the Pack Factor:
To
determine the ideal pack factor to specify in the Pack attribute, first use the
PackMaximum attribute by setting it to 'Yes'. Setting this attribute to Yes on
the first job run sets up iterative interactions with the Teradata Database to
heuristically determine the maximum possible pack factor.
At the end of the
run, this value is displayed in the Stream operator’s logged output. Specify that determined value in the Pack attribute on
subsequent runs. Set the PackMaximum to 'No'.
Array Support
The
Array Support feature allows DML requests containing only a single statement to
be executed once for each of multiple rows of input data, each row specified
being one of the members of the array. The DML must be contained within an
APPLY statement that includes Stream operator and does not set ArraySupport to
Off.
Use of
array support improves performance by:
•
Increasing the amount of data that can be sent per request from a total of
approximately 64KB to a total of approximately 1MB, with a limit of 64KB per
input data row.
Latency Interval:
Latency is the interval value, expressed in seconds,
between the flushing of stale buffers.
Latency interval is an option that is exclusively used by the
Stream operator.
In
normal operations (without latency), the Stream operator reads data from the
data stream until its buffer is full, then it writes all buffered rows to the
Teradata Database. The data is written to the Teradata Database only when the
buffer is full or when a checkpoint is taken.
However,
a latency interval (for example, set to 5400 seconds) causes the following:
• The
Stream operator reads data from the data stream, and empties its buffer,
writing the contents to the Teradata Database every 90 minutes (5400 seconds)
regardless of whether it is full.
• If
the buffer fills up within the time period (in this case, 90 minutes), it
writes to the
Teradata
Database as it would during normal operation.
To set
the latency interval, use the following syntax:
tbuild
-l <LatencyInterval> -f <file name>
The value used for the latency interval must be a
non-zero unsigned integer. The guiding factor is how stale you are willing to allow
data to be.
For example, to run a continual load script with a latency interval of two hours, enter:
tbuild
-l 7200 -f continualload
(this
may be used when the rows come in at very slow rate.. That time we can decide
maximum no of mins before applying the rows)
Operational Considerations:
NoPI Tables
Operations
other than Insert on target tables defined as NoPI are very non-performant,
unless there is an appropriate secondary index path to the data rows.
Sessions and Instances:
Because
the Stream operator uses Teradata SQL sessions in its communication with
Teradata Database, it does not use load tasks. The Stream operator provides
continuous updates using row level locking, allowing constant load operations
in the background during normal system use.
Both a
minimum and a maximum number of sessions can be used by the Stream operator.
The
minimum specification is one. The default is one session for each operator
instance.
The
number of sessions is evenly distributed among the number of operator
instances. If 20 sessions are requested and four instances of the operator are
invoked, then each instance will receive five of the sessions.
Checkpointing and Restarting:
Checkpoint options control how often a row is written to
the checkpoint file for the purposes of restarting a job. Unless otherwise
specified, a checkpoint is taken at the start of and at the end of the input
data. Since this process does not provide
granular restartability in the case of longer running jobs, checkpoint
intervals can be user-specified in terms of minutes or seconds.
•
Seconds - Use the command line option -z. For example, the following command
indicates that a checkpoint will be taken every 30 seconds:
tbuild -f <script file name> -z 30
•
Minutes or seconds - Specify as part of a DEFINE JOB statement. For example:
DEFINE
JOB test_job
SET
CHECKPOINT INTERVAL 30 SECONDS
or
SET
CHECKPOINT INTERVAL 10 MINUTES
DML Option Categories
The DML
Options can be classified into four logical categories, which can be specified
in any order:
• ARRAY
SUPPORT
•
SERIALIZE
• USE
• DML ERROR OPTIONS, which include:
- MARK
- IGNORE
- INSERT FOR
The
following restrictions apply:
•
Multiple ARRAY SUPPORT Options cannot be specified.
•
Multiple SERIALIZE Options cannot be specified.
•
Multiple USE Options cannot be specified.
•
Multiple DML Error Options (MARK,IGNORE, INSERT FOR) can be specified in one
block. However these cannot be interleaved with the other three DML Option
categories.
No comments:
Post a Comment