Database load script - Connect_ETL - 9.13

Connect ETL Vertica Load Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL Vertica Load Guide
Copyright
2023
First publish date
2003
ft:lastEdition
2023-09-11
ft:lastPublication
2023-09-11T19:09:11.694437

Using the environment variables available to the system from the configuration file, the database load script, verticaLoad.sh, does the following:

  • Establishes connectivity between the input data file and the Vertica database.

Upon loading data to a Vertica database, the initiator node is the initial node through which a direct connection and communication is established.

For every instance of the Connect ETL transformation and load job that runs in parallel with the other instances, the invoked database load script, verticaLoad.sh, assigns the initiator node within the array of nodes that constitute the Vertica database. Initially, the database load script determines the number of nodes in the Vertica array of nodes from the information in the configuration file and then assigns the initiator node based on the most appropriate node in the Vertica array at that time.

  • Based on the environment variable values available from the configuration file, the database load script generates and initiates the vsql command statement, which reads data from the standard input connection, communicates directly with the Vertica initiator node, and loads the data to the Vertica initiator node.

VSQL Parameter Options and Load Statements

The database load script automatically generates the vsql commands that contain the following arguments:

  • Parameters
  • Load statements

Parameters

Parameters contain the job and database environment variables, which the configuration file provides to the system. Among the vsql command parameters are the following:

  • Vertica hostname.
  • Username.
  • Password.
  • Database name.
  • Table within the Vertica database to which the data is loaded.
  • Compression type.
  • Reject and exception-handling file name and output directory.

Load Statements

Among the load statements that are available within the database load script for loading data and that are within the scope of this document are the following:

  • COPY
  • COPY LOCAL

COPY

COPY Option with Standard Input

To optimize performance, Vertica recommends using the COPY option within a vsql command.

When you use standard input, STDIN, as an option with the COPY statement within a vsql command, the vsql logic looks to your local system, that is, the local Windows or UNIX/Linux ETL server, not the Vertica initiator node, to retrieve standard input.

Consider the following example of a vsql statement using the COPY option with standard input:
vsql -w '$vertica' -h v025 -c "copy LINEITEM from 
'/data/tpch_package/lineitem.tbl.1' on any node,
'/data/tpch_package/lineitem.tbl.2' on any node,
'/data/tpch_package/lineitem.tbl.3' on any node
delimiter '|' DIRECT;"

COPY Option with Named Pipes

Within a UNIX/Linux environment, when a named pipe is specified with the COPY statement within a vsql command, the vsql logic looks to the Vertica initiator node within the Vertica database, not the local Windows or UNIX/Linux ETL server, to retrieve standard input. The Vertica initiator node distributes the data across other nodes as required. Prior to initiating the vsql command, you would typically move the data to the Vertica initiator node.

Note: Due to a Windows operating system limitation, you cannot use named pipes to load data to a Vertica database within a Windows environment.

Error Log Information

By default, the COPY statement within a vsql command writes the error log information file and the exception and reject files to the database administrator’s home directory in the following directory on the Vertica master node: <dbadmin><dbname>/<dbname>_node0001_catalog/CopyErrorLogs.

Consider the following example that specifies a catalog directory and provides the naming convention for exception files when the COPY option is used:
catalog-dir/CopyErrorLogs/<table-name-input-filename-copy-from-rejected-data>
Note: One exception reason per record is written to the exception file.
Consider the following example that specifies a catalog directory and provides the naming convention for exception files when the COPY option is used:
catalog-dir/CopyErrorLogs/<table-name-input-filename-copy-from-rejected-data>
Note: Each record of the original rejected data is written to a record of the reject file.

COPY LOCAL

COPY LOCAL Option with Standard Input or Named Pipe

You can specify either standard input or named pipes as an option with the COPY LOCAL statement. When you use either standard input or named pipes as an option with the COPY LOCAL statement within a vsql command, the vsql logic looks to the local Windows or UNIX/Linux ETL server, not the Vertica initiator node, to retrieve the input.

Error Log Information

When using the COPY LOCAL statement within a vsql command, the EXCEPTIONS and REJECTED DATA parameters must be specified in order to save the exception and rejected data files to the local Windows or UNIX/Linux ETL server. If the EXCEPTIONS and REJECTED DATA parameters are not specified, exception and rejected data files are not saved to the ETL server when the database load is successful and exceptions and reject data exists.

To save exception and rejected data after a successful load, the following is required within the COPY LOCAL statement:

  • EXCEPTIONS path parameter and REJECTED DATA path parameter.
  • One distinct exceptions filename and one distinct reject data filename.
    Note: The same filename cannot be used for both the EXCEPTIONS path and REJECTED DATA path parameters. The exceptions and rejected data files need not exist when the COPY LOCAL command is invoked. As the COPY LOCAL statement must be able to overwrite them, the exceptions and rejected data files must not be read only.
    Consider the following vsql example that specifies the exceptions and rejected data path and file names:
    \set exceptions ../except_reject/copyLocal.exceptions
    \set rejected ../except_reject/copyLocal.rejected

    Within the COPY LOCAL statement of the vsql command, the variable names for the exceptions and rejected data files are specified:

    COPY large_tbl FROM LOCAL rejected data :rejected exceptions :exceptions; 
    Note: Regardless of the number of input files loaded from the client to the Vertica server, the COPY LOCAL statement writes to only one exceptions file and one rejection file following a successful load.

Configuration File Exceptions and Rejected Data Environment Variables

To accommodate for the exceptions and rejected data requirement when using the COPY LOCAL statement, the Connect ETL Vertica configuration file includes the following environment variables, which must be set as follows:

  • BaseLogDirectory - Directory location where the exception and reject log files are written.
  • COPYLOCAL – Yes.
  • CAPTUREERRORS- Yes.
  • CAPTUREREJECTS – Yes.

To ensure that both the exception and rejected data files are written to the log directory on the ETL server, you must set the specified configuration file environment variables as follows: BaseLogDirectory set to the directory location where the exception, reject, and job log files are written; COPYLOCAL, CAPTUREERRORS, and CAPTUREREJECTS environment variables set to Yes.

Job Log Information

Regardless of whether you use the COPY or COPY LOCAL statements within the vsql command, a Connect ETL job log is generated for each Connect ETL job and the job log file contains the following information:

  • Connect ETL record load statistics.
  • Parameter and load statement values that drive the Connect ETL job and database load tasks.
  • vsql command that loads the data to the Vertica initiator node.

The job log file is written to the log location, which you specify as the BaseLogDirectory environment variable value in the configuration file.