HDFS (Hadoop) Datastores - connect_cdc_sqdata - Latest

Connect CDC (SQData) Apply engine

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect CDC (SQData)
Version
Latest
Language
English
Product name
Connect CDC (SQData)
Title
Connect CDC (SQData) Apply engine
Copyright
2024
First publish date
2000
ft:lastEdition
2024-07-30
ft:lastPublication
2024-07-30T20:19:56.898694

HDFS is a distributed Java-based file system typically used to store large volumes of data. HDFS is one of several subsystems that make up the framework called Hadoop: which actually consists of several subsystems that provide for parallel and distributed computation on large datasets:

  • HDFS, A distributed file system that utilizes a cluster of machines to provide high-throughput access to data for Big Data applications.
  • Map Reduce, the distributed processing framework that manages and controls processing across the cluster

Together, these subsystems allow for the distributed processing of large data sets scaling from single servers to thousands of machines. Each machine provides local computation and storage that is managed by the Hadoop software to deliver high-availability and high performance without relying on high cost hardware based high-availability.

The Hadoop HDFS target may be located on the same or different system running the SQData Apply Engine. The HDFS Target Datatastore is identified by the "url" consisting of the host address and port number where Hadoop is running, along with the fully qualified HDFS file name.

Hadoop HDFS also supports file rotation so that analytics can be performed against a set of HDFS files that are known to be static vs files that are continually being updated via replication. This option to rotate the active target file into multiple instances can be based on either time interval, file size or number of records.

Syntax
DATASTORE hdfs://[<hostname>[:<port_number>]] / <hdfs_file_name>
  OF JSON | AVRO
  AS <alias_name>
  DESCRIBED BY GROUP <group_name>
  STAGING SIZE <n>G
  STAGING DELAY <mmm>
Keyword and Parameter Descriptions
Keyword Description
<host_name> | localhost

Name of system running Hadoop HDFS. When Hadoop is running on the same machine as the SQData apply engine, the host address can be specified as localhost.

<tport> TCP/IP Port of the Target Hadoop HDFS
<hdfs_file_name>

The HDFS file name. SQData supports the dynamic specification of the file name using a wildcard. This is particularly useful when using long file names. If the hdfs_file_name contains an "*", then the url is dynamic and the "*" will be replaced with the alias name of the target DESCRIPTION. The "*" may be optionally preceded and followed by a string of characters to complete the full file name.

OF JSON | AVRO

Kafka "Topics" formatted as either JSON or AVRO

DESCRIBED BY GROUP <group_name> DESCRIPTION Group
STAGING SIZE <n>G and/or STAGING DELAY <mmm>

HDFS file rotation can be specified to occur once the file as reached a specific size or after a specific number of minutes have elapsed. To specify rotation after 4 Gigabytes have been written, you would specify STAGING SIZE 4G. To specify rotation every hour (60 minutes), you would specify STAGING DELAY 60. The default, if no STAGING keyword is specified, is one target file with continuous updates. If you specify STAGING, it is recommended that you use both STAGING SIZE and STAGING DELAY in your target datastore definition.

Example 1

Hadoop is running on the same machine as the SQData apply engine and is listening on port 9000. Only one HDFS file is being created, employee.dat. This setup is common in development / test environments.
DATASTORE hdfs://localhost:9000/employee.dat
  OF JSON
  AS TARGET
  DESCRIBED BY GROUP SOURCE
;

Example 2

Hadoop is running on hdfs.Host1, a different server than the SQData apply engine, and is listening on port 9001. Several different Source CDC records are being processed and the HDFS file name is dynamically generated. The output data is to be written in JSON format. In this example STAGING SIZE is specified with a rotation of 4 Gigabytes of data max per file instance or a delay of two hours (120 minutes) max.
DATASTORE hdfs://Host1:9001/*.dat
  OF JSON
  AS TARGET
  DESCRIBED BY GROUP SOURCE
  STAGING SIZE 4G
  STAGING DELAY 120
;
Note: When SQData applies updates to HDFS target file(s), a time stamp is appended to the file name, before the file extension if one exists, indicating the time when the file was created. For example, a file name of employee.dat will be show up as employee.2017-05-27-13.18.41.dat.