HDFS (Hadoop) - connect_cdc_sqdata - Latest

Connect CDC (SQData) Architecture

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect CDC (SQData)
Version
Latest
Language
English
Product name
Connect CDC (SQData)
Title
Connect CDC (SQData) Architecture
Copyright
2024
First publish date
2000
Last edition
2024-07-30
Last publish date
2024-07-30T20:18:12.036227

HDFS is a distributed Java-based file system typically used to store large volumes of data. HDFS is one of several subsystems that make up the framework called Hadoop: which is actually a framework of several subsystems that provide for parallel and distributed computation on large datasets:

  • HDFS, A distributed file system that utilizes a cluster of machines to provide high-throughput access to data for Big Data applications.
  • Map Reduce, the distributed processing framework that manages and controls processing across the cluster

Together, these subsystems allow for the distributed processing of large data sets scaling from single servers to thousands of machines. Each machine provides local computation and storage managed by the Hadoop software to deliver high-availability and high performance without relying on high cost hardware based high-availability.

Apply Engine

HDFS records can be written in a variety of formats including JSON and AVRO. In addition to automatically generating a JSON schema, when Confluent's Schema Registry is used for managing the schemas the Apply Engine will automatically register the Kafka topic schema's.

Replicator Engine

Apply Engine

HDFS records can be written in a variety of formats including JSON and AVRO. The Replicator automatically generates the JSON schemas, when Confluent's Schema Registry is used for managing the schemas the Apply Engine will automatically register the HDFS record schema and maintain those schemas automatically as the source tables evolve.