HDFS is a distributed Java-based file system typically used to store large volumes of data. HDFS is one of several subsystems that make up the framework called Hadoop: which actually consists of several subsystems that provide for parallel and distributed computation on large datasets:
- HDFS, A distributed file system that utilizes a cluster of machines to provide high-throughput access to data for Big Data applications.
- Map Reduce, the distributed processing framework that manages and controls processing across the cluster
Together, these subsystems allow for the distributed processing of large data sets scaling from single servers to thousands of machines. Each machine provides local computation and storage that is managed by the Hadoop software to deliver high-availability and high performance without relying on high cost hardware based high-availability.
The Hadoop HDFS target may be located on the same or different system running the SQData Apply Engine. The HDFS Target Datatastore is identified by the "url" consisting of the host address and port number where Hadoop is running, along with the fully qualified HDFS file name.
Hadoop HDFS also supports file rotation so that analytics can be performed against a set of HDFS files that are known to be static vs files that are continually being updated via replication. This option to rotate the active target file into multiple instances can be based on either time interval, file size or number of records.
DATASTORE hdfs://[<hostname>[:<port_number>]] / <hdfs_file_name>
OF JSON | AVRO
AS <alias_name>
DESCRIBED BY GROUP <group_name>
STAGING SIZE <n>G
STAGING DELAY <mmm>
Keyword | Description |
---|---|
<host_name> | localhost |
Name of system running Hadoop HDFS. When Hadoop is running on the same machine as the SQData apply engine, the host address can be specified as localhost. |
<tport> | TCP/IP Port of the Target Hadoop HDFS |
<hdfs_file_name> |
The HDFS file name. SQData supports the dynamic specification of the file name using a wildcard. This is particularly useful when using long file names. If the hdfs_file_name contains an "*", then the url is dynamic and the "*" will be replaced with the alias name of the target DESCRIPTION. The "*" may be optionally preceded and followed by a string of characters to complete the full file name. |
OF JSON | AVRO |
Kafka "Topics" formatted as either JSON or AVRO |
DESCRIBED BY GROUP | <group_name> DESCRIPTION Group |
STAGING SIZE <n>G and/or STAGING DELAY <mmm> |
HDFS file rotation can be specified to occur once the file as reached a specific size or after a specific number of minutes have elapsed. To specify rotation after 4 Gigabytes have been written, you would specify STAGING SIZE 4G. To specify rotation every hour (60 minutes), you would specify STAGING DELAY 60. The default, if no STAGING keyword is specified, is one target file with continuous updates. If you specify STAGING, it is recommended that you use both STAGING SIZE and STAGING DELAY in your target datastore definition. |
Example 1
DATASTORE hdfs://localhost:9000/employee.dat
OF JSON
AS TARGET
DESCRIBED BY GROUP SOURCE
;
Example 2
DATASTORE hdfs://Host1:9001/*.dat
OF JSON
AS TARGET
DESCRIBED BY GROUP SOURCE
STAGING SIZE 4G
STAGING DELAY 120
;