Sunday, 22 September 2013

5.1 Teradata Parallel Transporter - Data Connector Operator

DataConnector capabilities in brief:

Data Connector Operator can play Dual role :

  • It can act as Producer Operator to read data from file or from access module.
  • It can also act as Consumer Operator to write data to flat files or access module.

Data Connector can also used for scanning directories.

We can use multiple instances of the operator.




DataConnector Operator as Producer:

Data connector can read files in multiple ways as floows:

  1. Read a single flat file by specifying a directory path and file name.

  1. Read a directory of flat files by specifying a directory path and a wildcard character (*) with the file name. The wildcard character allows a directory that contains hundreds of files to be treated as a single data source.

  1. Read data using the supported access modules

Following is the example of the DataConnector Producer:


DEFINE DATACONNECTOR OPERATOR
DESCRIPTION 'DataConnector'
TYPE DATACONNECTOR PRODUCER --> This is mandatory to make the operator as Data Connector producer.
SCHEMA PRODUCT_SOURCE_SCHEMA
ATTRIBUTES
(
VARCHAR AccessModuleName,
VARCHAR AccessModuleInitStr,
VARCHAR FileName,
VARCHAR Format,
VARCHAR OpenMode,
VARCHAR TextDelimiter,
VARCHAR ArchiveDirectoryPath,
VARCHAR DirectoryPath,
VARCHAR PrivateLogName,
VARCHAR VigilElapsedTime,
VARCHAR VigilWaitTime,
VARCHAR VigilStartTime,
VARCHAR VigilStopTime,
VARCHAR VigilSortField
);


Depending upon the Way we are using Data Connector Producer the key attributes differ.

Operator Usage
Key Attributes
Data Connector Operator  as producer without Access Module
• FileList
• FileName
• Format
• MultipleReaders
• OpenMode
• Te x t D e limit e r
• PrivateLogName
Data Connector Operator  as producer with Access Module
• AccessModuleName
• AccessModuleInitStr
• FileName
• Format
• OpenMode
• Te x t D e limit e r
• PrivateLogName
Data Connector Operator as producer using Batch Scan.
Batch Scan: The job runs on a specific time or is manually run and it reads all the files in a given directory that satisfy the filename we give.
• ArchiveDirectoryPath
• DirectoryPath
• EnableScan
• FileName
• Format
• OpenMode
• Te x t D e limit e r
• PrivateLogName
Data Connector Operator as Producer using Active Scan.
In Active scan the job keeps looking for the file in the given directory and starts processing as soon as the file comes.
• ArchiveDirectoryPath
• DirectoryPath
• EnableScan
• FileName
• Format
• OpenMode
• Te x t D e limit e r
• VigilElapsedTime
• VigilStartTime
• VigilSortFile
• VigilWaitTime
• PrivateLogName





DataConnector Operator as Consumer:


Following is the example of Data Connector as Consumer:

DEFINE DATACONNECTOR OPERATOR
DESCRIPTION 'DataConnector'
TYPE DATACONNECTOR CONSUMER --> Note this should be the TYPE to make it a Data Connector Consumer operator
SCHEMA CONSUMER_SOURCE_SCHEMA
ATTRIBUTES
(
VARCHAR AccessModuleName,
VARCHAR AccessModuleInitStr,
VARCHAR FileName,
VARCHAR Format,
VARCHAR OpenMode,
VARCHAR Text Delimiter,
VARCHAR PrivateLogName
);

Depending upon the Way we are using Data Connector Consumer the key attributes differ.

Operator Usage
Key Attributes
Data Connector Operator  as producer without Access Module
• FileList
• FileName
• Format
• OpenMode
• Te x t D e limit e r
• PrivateLogName
Data Connector Operator  as producer with Access Module
• AccessModuleName
• AccessModuleInitStr
• FileName
• Format
• OpenMode
• Te x t D e limit e r
• PrivateLogName


4.2 Teradata PT Performance - Sessions and Instances


4.2 Teradata PT Performance- Sessions and Instances

With in the TPT job scripts we can specify multiple instances of the operator at the point where the operator is referenced in an APPLY statement.

Operator instances then execute in parallel to complete the task.
Each of these instances in turn can login multiple sessions.

Thus multiple instances and multiple sessions each instance provides high degree of parallelism.

The following operators can be configured to enhance job performance through the optimization of instances and sessions:
• DataConnector (instances only)
• Export
• Load
• SQL Inserter (instances only)
• Stream (instances only)
• Update








4.2.1 Specifying Number of Sessions

By default each operator will attempt to simultaneously log on to one Teradata Database session for each AMP configured on the Teradata Database system.
For some jobs, the default parallelism may be excessive. In other cases, there may not be enough available AMPs to provide the sessions necessary to run the job efficiently.

This is the reason why we use the attributes - MaxSessions and MinSessions that support session limits.

  • MaxSessions determines the maximum number of sessions an operator can use.

  • MinSessions, determines the minimum number of sessions that must be available in order for the job to run.
Proper setting of the MinSessions attribute prevents the job from running until there are enough AMPs for it to run at an acceptable rate.




We should consider the following factors while specifying the MaxSessions attribute:

  • If no value is set for MaxSessions, the operator attempts to connect to one session per available AMP.

  • The DDL, ODBC, and SQL Selector operators are limited to a single concurrent session, that is, one session each per single-step job, or one each session per step in a multi-step job.

  • The number of sessions specified by the value of the operator MaxSessions attribute are balanced across the number of operator instances. For example, if the Update operator is defined with two instances, and the MaxSessions attribute is set to 4, each instance of the defined Update operator will run two sessions
If the value of the MaxSessions attribute for an operator is smaller than the number of operator instances, the job will abort.

  • If the value of MaxSessions is set to a number greater than the number of available AMPs, the job runs successfully, but logs on only as many sessions as available AMPs.

  • For some jobs, especially those running on systems with a large number of AMPS, the default session allocation (one per available Teradata Database system AMP) may not be advantageous, and you may need to adjust the MaxSessions attribute value to limit the number of sessions used.

  • The SQL Inserter operator supports only one session.

  • The Stream operator uses an SQL protocol, so it is not seen as a “load job” by the Teradata Database. Therefore, Stream operator connects to as many sessions as requested, up to the number of sessions allowed by the Teradata Database.













4.2.2 Specifying Multiple instances:

We can specify the number of instances for an operator in the APPLY TO or SELECT FROM statement .

Syntax : operator_name [number of instances]
Ex: APPLY <DML>...TO OPERATOR (UPDATE_OPERATOR [2]...)

It is very Important to note that Producer and Consumer Operators use multiple instances differently.

  • Producers automatically balance the load across all instances, pumping data into the data stream as fast as they can

  • By default, consumers will use only as many instances as needed. If one instance can read and process the data in the data stream as quickly as the producers can write it, then the other instances are not used. If the first instance cannot keep up with the producer operators then the second instance is engaged, and so on.

However this behavior of consumer operators can be changed by using the -c option.
This option indicates that the data blocks must be shipped to the target consumer operators in a cyclical, round-robin manner, providing a more even distribution of data to consumer operators.

If the number of instances is not specified, the default is 1 instance per operator.
Teradata PT will start as many instances as specified, but it uses only as many as needed. This might actually lead to wastage of system resources. So we must specify multiple instances  cautiously.
We should read the Teradata PT log file, which displays statistics showing how much data was processed by each instance. Reduce the number of instances if you see under-utilized instances of any operators. If all instances are used add more and see if the job runs better.

Imp Note:  The number of sessions that we specify will be shared by the number of instances. So if we specify 100 session and we have 2 instances then each of them gets 50.
If the number of instances exceeds the number of available sessions, the job aborts.

Therefore, when specifying multiple instances make sure the MaxSessions attribute is set to a high enough value that there is at least one session per instance.

4.1 TPT and Logon security, Encryption and Roles

4.1 TPT and Logon security, Encryption and Roles:

The following security-related attributes may be required for logons to Teradata Database depending on the user authentication method employed.

UserName
The Teradata Database username.
UserPassword
The Teradata Database password associated with the username
TdpId
Identifies the connection to the Teradata Database.  If you don't specify a TdpId, the system will use the default Tdpid, as defined in the Teradata Client clispb.dat
LogonMech
A security mechanism used to externally authenticate the user.
Optional, depending on security setup.
TD 2 is the default mechanism and the system will automatically defer to it unless the default has been set to another mechanism or TD 2 has been disabled.

External authentication system includes Kerberos or Active Directory. External authentication is only available for jobs launched from network-attached clients. It requires special setup.

Do use external authenticationto log on a Teradata PT job script until you understand
the associated setup and logon requirements
LogonMechData
Username, password, and other data required by an external authentication mechanisms to complete the logon.

Values for the security attributes can be assigned in any the following statements, which are  listed in the order they are processed, from lowest to highest priority.
• DEFINE OPERATOR
• in an APPLY statement, or SELECT clause of an APPLY statement


Tip: Specifying the UserName and UserPassword values as job variables avoids problems that may occur if such logon information is kept in plain view in job scripts.

The following operators access non-Teradata data sources. However, since they logon through an access module, they do not require logon information.

• DataConnector
• FastLoad INMOD Adapter
• FastExport OUTMOD Adapter
• MultiLoad INMOD Adapter

For these operators, logon information must be entered as part of the access module or
INMOD/OUTMOD routine through which the operator accesses the outside data source.

The ODBC operator functions differently from other such operators, and allows the option of specifying the following in the job script:
• UserName
• UserPassword








4.1.1. Using Encryption and Side effects:

All Teradata PT operators that interface with the Teradata Database have the option to encrypt
job data during transmission across the network. The data is then decrypted and checked for
integrity when it is received by the Teradata Database.

Its imp to note that Encryption is only available for network-attached clients.

Following operators support encryption:
• DDL
• Export
• Load
• SQL Inserter
• SQL Selector
• Stream
• Update

The Attribute DataEncryption is set on to enable encryption. The default setting is ‘Off.’

There involves a side effect that Encryption may result in a noticeable decrease in load/unload performance due to the time required to encrypt, decrypt, and verify the data.






4.1.2 Using Roles with TPT


Each operator that communicates with the Teradata Database logs on separately, and Teradata PT scripts do not support use of the SET ROLE statement (except for the DDL operator). Since the default role cannot be reset for a Teradata PT session, make sure that Teradata PT user default role includes all the necessary privileges.

Saturday, 21 September 2013

3.3 Initial Setup for Teradata database and Client system ( non- mandatory)

Many times there is some initial set up that needs to be done at both the database end and also the client end.

At the database end we might need to Drop error tables and then set up the target table for a job that loads data into Teradata Database. We can also  drop staging tables from previous steps when employed in multi-step jobs.

This can be done using the  DDL operator.
We need to write a 1st step that uses DDL operator to perform the drop and create table operations.

Similarly at the client side we may need to perform certain operations like copying file to a different directory or unzip it before using it.
Such things can be taken care by the OS command operator. We can define 1st step that uses the OS operator to run a client operating system command to move the files of unzip them.






3.2 Job Variables files (Revisited)

3.2 Job Variables files (revisited):

Job variables can be assigned values in two types of files:

  1. Global variables file: Every PT job automatically reads the global variables file.

The user-defined path and filename of the global job variables file must put inside the  twbcfg.ini file as an entry

GlobalAttributeFile = ‘<userspath>/<usersGlobalJobVariablesName.’

  1. Local job variables file: We can create local job variables file specific to the job.
The local job variables file needs to be specified separately using thee -v option on the tbuild command.

Ex: tbuild -f weekly_update.tbr -v local.jobvars

On MVS local job variables file is specified through the DDNAME of ATTRFILE.


Using job variables requires two things:

  1. Setting up scripts that refer to local or global job variable values by specifying them using the form <attribute name>=@<job variable name>

Ex:

VARCHAR UserName=@Name
VARCHAR UserPassword=@Password
VARCHAR TdpId=@Tdp


  1. Entering variable value assignments in the global job variables file or the local job variables file.

        We can add the variable assignments
  • Separated by commas
  • Or one assignment per line with no comma's

Ex:

 MyTdpId         = 'defaultTdpId'
,MyUserName      = 'defaultUserName'
,MyPassword      = 'defaultPassword'


3.1 Teradata PT configuration file twbcfg.ini:


Before we start running Teradata parallel transporter we should set up certain files and directories as follows:

  1. Global Variables file : As we had discussed earlier this file is used by all the jobs to read the global variables set up. All the jobs  running on the system read this file.
  1. Checkpoint Directory: Directory for checkpoint files.
  1. Log Directory: Directory for log files.

The name of the configuration file where we can set all this is twbcfg.ini

Windows
<installation directory>\twbcfg.ini
UNIX
<installation directory>/twbcfg.ini
z/OS
Specified through the DDNAME of AT TRF I L E.

Values in this file are specified in the following format:

<parameter> = <single-quoted string>;

Following is an example of the twbcfg.ini file on unix:

GlobalAttributeFile = '/usr/tbuild/<version_number>/global.jobvariables.txt';
CheckPointDirectory = '/usr/tbuild/<version_number>/checkpoint';
LogDirectory = '/var/log/tbuild';


To find the tpt installation directory use the 'which tbuild' command. This will give you the name of the bin directory that contains the tbuild executable.

which tbuild
/apps/tpt13.1/teradata/client/13.10/tbuild/bin/tbuild

A level above bin directory is where the twbcfg.ini is located.
(note the directory name is specific to my system. It might be a bit different at your installation)

/apps/tpt13.1/teradata/client/13.10/tbuild> ls
bin          inc          lib64        msg          sample       tptapi       twboper.ini
checkpoint   lib          logs         msg64        template     twbcfg.ini   version

Following are the contents on this file twbcfg.ini on my installation.

more twbcfg.ini
CheckpointDirectory='/apps/tpt13.1/teradata/client/13.10/tbuild/checkpoint'
LogDirectory='/apps/tpt13.1/teradata/client/13.10/tbuild/logs'


We don’t have the global variables file specified.

2.5 Teradata Parallel Transporter Scripting language and Syntax rules


TPT scripting Language and Syntax rules:

TPT uses an SQL-like scripting language to define extract, updating, and load functions in a job script.
All Teradata PT operators use the same language.

  1. Case  sensitivity syntax:

  • Attribute Names (Ex: TDPID) are case insenstive. TDPID and tdpID both mean the same
  • Most of Attribute Values (Ex: 'GT11_loadoper_privatelog' in above example ) are case insensitive , but certain things like filename, directory names can be case sensitive depending upon the platform(Ex: Unix is case Senstive. Files 'myfile.txt' and 'MYFILE.txt' both are different.
  • Non-attribute object parameters, such as the syntax elements in a DEFINE JOB statement, are case-sensitive.

  1. Defining Objects:

  • Objects should be defined  before they are used.

  1. Keywords Restrictions:

  • We should not use TPT keywords like SCHEMA, OPERATOR as the identifiers for column names, attributes and other values.

  1. Keywords INTEGER and VARCHAR are used to declare the attributes of the  operators:
    ex:
                       VARCHAR PrivateLogName    = 'GT11_loadoper_privatelog',
                       INTEGER MaxSessions       =  32,
                       INTEGER MinSessions       =  1,

                      VARCHAR and INTEGER are mandatory required when an attribute if declared but                       attribute value is not specified.

                       However if the attribute declaration includes values then we don’t need to write                        VARCHAR and INTEGER..
                       This means instead of the above we could have just written as follows:

                       PrivateLogName    = 'GT11_loadoper_privatelog',
                       MaxSessions       =  32,
                       INTEGER MinSessions       =  1,

                        This is allowed because we are providing the attribute values along with the                            declarations.

  1. Quotation  Marks

  • Character string literals must be enclosed in single quotes.
  • Values for VARCHAR attributes must be enclosed in single quotes, and embedded quotes must be escaped with two consecutive single quotes.

                       Ex:PrivateLogName    = 'GT11_loadoper_privatelog',

  • Values for INTEGER attributes require no quotes.

                       Ex: INTEGER MinSessions       =  1,

  1. SQL notation:

  • IF SQL statements span multiple lines there should be a space or a tab character between the last character of the line and the 1st character of the next line.
                             If not provided the it joins the two character strings, resulting in either an                                error or the processing of an unintended SQL statement.

                          Ex:

                          'CREATE TYPE INSV_INTEGER AS INTEGER FINAL
                          INSTANCE METHOD IntegerToFloat()

                          In above example  following code would produce an error if no space or tab was added between “FINAL” and “INSTANCE” because the Teradata Database would see the invalid keyword FINALINSTANCE.

  1. Comments


  • TPT supports C- style comments. /* */

2.4 Creating and Using Job variables in Teradata Parallel Transporter



Using and Setting up Job variables:

In the above example we can see many attribute values that start with a @. These are actually job variables.

Ex:
      VARCHAR TdpId             = @MyTdpId,
      VARCHAR UserName          = @MyUserName,
      VARCHAR UserPassword      = @MyPassword,

Actual values are passed at run time using multiple ways explained later.
Once variables have been defined, they can be reused in any job where the value of the variable is valid.
A common use of variables is for the values of operator attributes.

To be able to use job variables we need to complete 2 activities:

  1. Assign values to these variables. There are many places where we can assign values to these variables. Following is the list in the order of priority.


  • Command line (Highest Priority)

You can specify variables on the command line, as follows:

tbuild -f weekly_update.tbr -u "UsrID = 'user', Pwd =  'pass' "




  • A local job variables file

We Can have a local file that contains all the variables.
We can specify a local job variables file, which contains the values for job variables, using the -v option on the command line as follows:

tbuild -f weekly_update.tbr -v local.jobvars

On z/OS, specify a local job variables file through the DDNAME of ATTRFILE.



  • Global job variables file

This global job variables file is read by every Teradata PT job. So if we have any variable that we need all the jobs to use we can place it here.
Place common, system-wide job variables in this file, then specify the path of the global job variables in the Teradata PT configuration file by using the GlobalAttributeFile parameter.
 

  • Job script itself (lowest)




  1. Reference variables in the job script.

Job variable reference is composed of the @ symbol, followed by a unique identifier for the variable.

These variables can be referenced in two places:


  • Specifying Variables for Attributes:

Ex:
VARCHAR UserName = @MyUserName,
VARCHAR UserPassword = @MyPassword

If the attribute is:
  • A character data type, the job variable value must be a quoted string.
  • An integer data type, the job variable value must be an integer.
  • An array attribute, the variable value must be an array of values of the attribute data type


  • Specifying Non-Attribute Job Variables

Job variables can be used for object names and other parameters.

Ex:
'INSERT INTO TABLE ' || @TargTable || ' (:col1, :col2, . . ., :coln);'

Here the name of the target table is passed as variable from either of the 4 ways explained earlier.
However the technique of using them is different. We need to use concatenation as shown above.



2.3 Setting up Job steps in a job and tbuild -s option

2.3 Using and Setting up Job Steps:

Using job steps is optional, but when used, they can execute multiple operations within a single Teradata PT job.

Following are some rules of using job steps:

A job must have at least one step, but jobs with only one step do not need to use the STEP syntax.

Each job step contains an APPLY statement
Some job steps may use a single standalone operator, such as:
• DDL operator, for setup or cleanup operations in the Teradata Database.
• The Update operator, for bulk delete of data from the Teradata Database.
• OS Command operator, for operating system tasks such as file backup.

Job steps are executed in the order in which they appear within the DEFINE JOB statement.

Each job step must complete before the next step can begin.


Ex:
   STEP setup
   (
      APPLY
      ('DROP TABLE SOURCE_EMP_TABLE;'),
      ('DROP TABLE TARGET_EMP_TABLE;'),
      ('DROP TABLE GT11_LOADOPER_ERRTABLE1;'),
      ('DROP TABLE GT11_LOADOPER_ERRTABLE2;'),
      ('DROP TABLE GT11_LOADOPER_LOGTABLE;'),
      ('CREATE TABLE SOURCE_EMP_TABLE(EMP_ID INTEGER,
                                      EMP_NAME CHAR(10));'),
      ('CREATE TABLE TARGET_EMP_TABLE(EMP_ID INTEGER,
                                      EMP_NAME CHAR(10));'),
      ('INSERT INTO SOURCE_EMP_TABLE(1,''JOHN'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(2,''PETER'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(3,''FRANK'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(4,''MARY'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(5,''ELLEN'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(6,''MICHAEL'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(7,''SAM'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(8,''JONATHAN'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(9,''MICHELLE'');'),
      ('INSERT INTO SOURCE_EMP_TABLE(10,''ALICE'');')
      TO OPERATOR (DDL_OPERATOR () );
   );

   STEP load_data
   (
      APPLY
      ('INSERT INTO TARGET_EMP_TABLE (:EMP_ID, :EMP_NAME);')
      TO OPERATOR (LOAD_OPERATOR () [1] )

      SELECT * FROM OPERATOR (EXPORT_OPERATOR() [1] );
   );

Here in the above example we have two steps 'setup' and 'load_data'. Note the keyword STEP is used to define a step and has its own opening and closing brackets with an ending semi colon.
Each Step should have one APPLY clause.
'load_data' step runs only after step 'setup' is complete.


We can start a job from step one or from an intermediate step.
The tbuild -s command option allows you to specify the step from which the job should start, identifying it by either  the step name, as specified in the job STEP syntax, or by the implicit step number, such as 1, 2, 3, and so on. Job execution begins at the specified job step, skipping the job steps that precede  it in the script.


Say we are running a TPT job that contains 5 steps. If the job has run 1st two steps completely and failed in 3rd we can fix the issue and restart the job directly from failed step using the -s option on tbuild command.