Sunday, 10 November 2013

9.1 Teradata Parallel Transporter - Load Operator - Basics

Load Operator Capabilities:

  • Load Operator is a consumer operator.

  • Load Operator uses Teradata FastLoad protocol to load a large volume of data at high speed into an empty table in Teradata Database.





Load Operation Phases:

Following are the phases involved in loading the Load process.

Acquisition Phase:
Data from the input stream is transmitted to the AMPs. The AMPs 'acquire' the data.
The acquisition phase is not complete until all data sources are exhausted and all data rows are on the appropriate AMPs.
The data is not yet sorted or blocked, and therefore, is not yet accessible.
Application Phase
Data is sorted, blocked, and put into its final format.
All activity in this phase is AMP-local.
Data is accessible after the completion of this phase.






Load Operator Definition:

Following is an example of a Load operator definition.

DEFINE OPERATOR LOADOP
DESCRIPTION 'Teradata PT LOAD OPERATOR'
TYPE LOAD
SCHEMA INPUT_SCHEMA
ATTRIBUTES
(
VARCHAR TdpId,
VARCHAR UserName,
VARCHAR UserPassword,
VARCHAR TargetTable,
VARCHAR LogTable,
VARCHAR ErrorTable1,
VARCHAR ErrorTable2,
VARCHAR PrivateLogName
);


TDPID :
It’s a optional Security attribute that specifies the name of the Teradata database machine(non-mainframe platforms) or TDP(mainframe platforms) for Export operator job.
If we do not specify the value for the TdpId attribute, the operator uses the default TdpID established for the user by the system administrator.

UserName:
Attribute that specifies the user name on the Teradata Database.
UserPassword :
Attribute that specifies the password associated with the user name.
PrivateLogName
Optional attribute that specifies the name of a log that is maintained by the Teradata PT
Logger inside the public log. The private log contains all of the messages produced by the operator.
TargetTable
It’s the target table to be loaded.
The table should be empty and without any secondary or join indexes.

When the target table is NOPI, the restriction that table must be empty does not exist.
LogTable
It is the restart log table.

Restarts are common during staged loading operations. When additional data is available after a job is paused, the job is restarted by submitting a second script that specifies the additional data. The Load operator recognizes the job as a restart, reads the restart log to determine the status of the job, then loads the additional file.

Restarts can also occur following any unexpected error on a database. For example, if a table
runs out of available space during a load operation, the job terminates, the table is paused, and
a checkpoint is recorded in the restart log. Pausing the job in this way allows you to manually
increase the available space, if needed, then restart the job because the load operation can
restart a job from the last checkpoint in the restart log.

(Fastload does not have a logtable )
ErrorTable1

Error Table1 contains most of the errors connected with data and the data environment. The
following types of errors are captured:
•Constraint violations- Records that violate a range or value constraint defined for specific columns of a table.
 
•Unavailable AMP- Records to be written to a non-fallback table about an offline AMP.
 
•Data conversion errors- Records that fail to convert to a specified data type.

Jobs can use the default names of the error tables, or names can be user-specified as an attribute of the Load operator.
ErrorTable2
Error Table2 contains all of the rows that have violations of the unique primary index.
In case of Load Operator This error table is not used when the target table has a non-unique primary index.

(in case of update operator this table can get records from other conditions too)

Jobs can use the default names of the error tables, or names can be user-specified as an attribute of the Load operator.








Auto-generation of Error Tables:

A target table must exist on the Teradata Database when the Load operator job is executed.

However If we have not specified error tables (specifying them is optional), the Load operator automatically creates names of the error tables as follows:

• The first error table is ttname_ET
• The second error table is ttname_UV

where ttname is the name of the corresponding target table.

The value of the TargetTable attribute is truncated to 27 characters before the suffixes "_ET" and "_UV" are appended. This means that if the value of the TargetTable attribute is a fully qualified table name and that fully qualified name exceeds the maximum supported length of a database object (30 characters), the generated names for the error tables may not be what you intend. In such a situation, Teradata recommends that you provide the names of the error tables and not rely on the Load operator to generate the names for these tables automatically.






Duplicate Rows:

Duplicate rows, which are exact duplicates of existing table rows, are never inserted, even if the target table is defined as a multiset table.
 
Duplicate row violations are thus not captured in either Error Table 1 or Error Table 2. Instead, they are counted and reported in the status log at the end of a job.

However If the target table is defined as a NoPI table, duplicate rows are inserted. NoPI tables are inherently multiset since no duplicate row checking is possible; with NoPI tables, duplicate rows can be on different AMPs. Therefore, no duplicate row elimination is performed.

If a table has a unique primary index, a circumstance where there is duplicate row takes precedence over a duplicate primary index error, meaning that the offending row is counted and reported, but it is not captured in Error Table 2.




ErrorLimit attribute:

ErrorLimit attribute is used to specify a limit to the number of errors that can be tolerated before a job is terminated.

This limit, which is specified with the ErrorLimit attribute, represents the total number of errors written to the first error table per instance of the Load operator, not to all instances  combined. Therefore, if an error limit is set to 1,000, a single load instance must detect that 1,000 rows are inserted into the error table before the job is terminated.

The error limit can also be reached at checkpoint time; see the examples below.

Consider these examples if there are two instances running and the Error Limit has been set to 1000.

• If either instance by itself reaches 1000, it will terminate the job by returning a fatal error.
• If instance #1 processes 500 error rows and instance #2 processes 500 error rows but does not reach a checkpoint. The job will continue processing.
• If instance #1 processes 500 error rows and instance #2 processes 500 error rows but does reach a checkpoint. The total number of error rows for all instances combined is determined at checkpoint time and at the end of the Acquisition Phase. If the total of all instances exceeds the error limit at that time, the job will terminate with an error.




Staged Loading:

Staged loading is the ability to pause an active load operation until additional data is available.

Teradata PT can load the first two files, pause the Load operation, and then load the third file when it is available.
Staged loading is set by the attribute PauseAcq = 'Y', which prevents the Load operator from proceeding to the application phase. Each stage of the load operation is accomplished with a separate job script: one for the acquisition phase, and one for the application phase


For example:
To accomplish the scenario with three files (one of which is unavailable until the next day), run Job1 on Day 1 using the two existing files as input to the load, with the PauseAcq = 'Y'setting.

When this stage of the job is finished, the target table is paused and becomes inaccessible to users. Attempts to access the target table (or the error tables) return the following error message:

Operation Not Allowed <tablename> is being loaded

On Day 2, restart the paused job by running Job2 using the third, newly available file as input. For this stage, set PauseAcq = 'N'. When this stage of the job finishes, the table is fully loaded and ready for access.

Note: A paused table, though inaccessible, can be dropped.






Sessions and Instances:

A minimum and a maximum number of sessions (the session limits) can be specified for the Load operator. Both can also be specified for the Export and Update operators.

Consider the following usage notes which apply to all three types of operators.
• The maximum sessions connected can never exceed the number of available AMPs in the system, even if a larger number is specified.
• The default is one session per available AMP.
• For the MinSessions attribute, the minimum specification is one.
• The MaxSessions attribute can be set to a number smaller than the number of AMPs on the database server if fewer sessions are suitable for the job.





Limits on Load jobs:

The Load operator requires one designated Teradata Database loader task.

The number of concurrent load tasks is, however, configurable in the Teradata Database environment using the same MaxLoadTasks field control used by FastLoad, FastExport, and MultiLoad.





Checkpointing and Restarting:

The Load operator takes checkpoints at the beginning and end of the acquisition phase.

More granular checkpoints during the acquisition phase can be specified using the command line option -z when running Teradata PT using the tbuild command. The -z option specifies checkpoint intervals in terms of seconds.

The following command string is an example of the -z option:

tbuild -f <file name> -z 30

In this command, the -f option indicates the script that is input to tbuild, and the -z option indicates that a checkpoint will be taken every 30 seconds.

The DEFINE JOB statement can also be used to specify a checkpoint value.

Checkpointing during the application phase is managed internally by the Teradata Database, and therefore is not user-controlled.





Load Operator as Standalone Operator:

The Load operator can function as a standalone operator and supports an APPLY statement with no SELECT statement and no INSERT statement.

When we use LOAD as a standalone operator, use one of the following definitions:

TYPE LOAD STANDALONE
TYPE LOAD

Note that The STANDALONE keyword is optional.

Example:

DEFINE JOB LOAD_USER_DATA
(
DEFINE OPERATOR LOAD_OPERATOR
TYPE LOAD
(
.
.
.
);
.
.
.
APPLY
TO OPERATOR (LOAD_OPERATOR[1]);
);

Note: Use LOAD as a standalone operator to apply data on the target table without sending more data.


We might use this in a following scenario. Say we have 3 files loaded at separate times in a day. In this case we create 3 TPT scripts each with PauseAcq attribute as 'Y' and then a 4 script with only standalone LOAD operator which triggers the application phase and loads the data actually to the table.

No comments:

Post a comment