Determine HDFS datastore specifications - connect_cdc_sqdata - Latest

Connect CDC (SQData) HDFS Quickstart

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect CDC (SQData)
Version
Latest
Language
English
Product name
Connect CDC (SQData)
Title
Connect CDC (SQData) HDFS Quickstart
Copyright
2024
First publish date
2000
Last updated
2024-11-25
Published on
2024-11-25T15:07:58.704694

The Hadoop HDFS target may be located on the same or different system running the SQData Apply Engine. The HDFS Target Datatastore is identified by the "url" consisting of the host address and port number where Hadoop is running, along with the fully qualified HDFS file name.

Hadoop HDFS also supports file rotation so that analytics can be performed against a set of HDFS files that are known to be static vs files that are continually being updated via replication. This option to rotate the active target file into multiple instances can be based on either time interval, file size or number of records.

Syntax
DATASTORE hdfs://[<hostname>[:<port_number>]] / <hdfs_file_name>
 OF JSON | AVRO
 AS <alias_name>
 DESCRIBED BY GROUP <group_name>
 STAGING SIZE <n>G
 STAGING DELAY <mmm>
Keyword and Parameter Descriptions
Keyword Descriptions
<host_name> | localhost Name of system running Hadoop HDFS. When Hadoop is running on the same machine as the SQData apply engine, the host address can be specified as localhost.
<tport> TCP/IP Port of the Target Hadoop HDFS.
<hdfs_file_name> The HDFS file name. SQData supports the dynamic specification of the file name using a wildcard. This is particularly useful when using long file names. If the hdfs_file_name contains an "*", then the url is dynamic and the "*" will be replaced with the alias name of the target DESCRIPTION. The "*" may be optionally preceded and followed by a string of characters to complete the full file name.
OF JSON | AVRO Kafka "Topics" formatted as either JSON or AVRO
DESCRIBED BY GROUP <group_name> DESCRIPTION Group
STAGING SIZE <n>G and/or STAGING DELAY <mmm>

HDFS file rotation can be specified to occur once the file as reached a specific size or after a specific number of minutes have elapsed. To specify rotation after 4 Gigabytes have been written, you would specify STAGING SIZE 4G. To specify rotation every hour (60 minutes), you would specify STAGING DELAY 60. The default, if no STAGING keyword is specified, is one target file with continuous updates. If you specify STAGING, it is recommended that you use both STAGING SIZE and STAGING DELAY in your target datastore definition.

Example 1

Hadoop is running on the same machine as the SQData apply engine and is listening on port 9000. Only one HDFS file is being created, employee.dat. This setup is common in development / test environments,
DATASTORE hdfs://localhost:9000/employee.dat
 OF JSON
 AS TARGET
 DESCRIBED BY GROUP SOURCE
;

Example 2

Hadoop is running on hdfs.Host1, a different server than the SQData apply engine, and is listening on port 9001. Several different Source CDC records are being processed and the HDFS file name is dynamically generated. The output data is to be written in JSON format. In this example STAGING SIZE is specified with a rotation of 4 Gigabytes of data max per file instance or a delay of two hours (120 minutes) max.
DATASTORE hdfs://Host1:9001/*.dat
 OF JSON
 AS TARGET
 DESCRIBED BY GROUP SOURCE
 STAGING SIZE 4G
 STAGING DELAY 120
;
Note: When SQData applies updates to HDFS target file(s), a time stamp is appended to the file name, before the file extension if one exists, indicating the time when the file was created. For example, a file name of employee.dat will be show up as employee.2017-05-27-13.18.41.dat.