Kafka is a robust clustered distributed streaming platform used to build real-time streaming replication pipelines that reliably move data between systems or applications using a publish and subscribe architecture.
Customers choose it because they want to do more than simply replicate captured source data into a, dare we say legacy, relational datastore. Some use the stream of kafka messages as the mechanism to identify "events" that trigger subsequent downstream business processes. Others use it to populate a "big data" repository where other tools will be used for analytics or to answer questions that may not even be known at the start of a project.
Apply Engine
Utilizing the Engine REPLICATE function, the Apply Engine automatically generates JSON and registration of AVRO schemas.
Replicator
The Replicator Engine fully automates the propagation of source schema changes and Kafka message production using AVRO and the Confluent Schema Registry. The Replicator also supports parallel processing of the replication workload through multiple Producer threads with the number of threads or workers specified at run-time. This means that Connect CDC (SQData) becomes a utility function within the enterprise architecture, reacting to Relational schema changes without interruption and without maintenance of the Connect CDC (SQData) Kafka producer configuration running in your Linux environment.
Replicator Distributor
When operating as a Parallel Processing Distributor IMS CDCRAW records stream to Kafka topics partitioned by Root Key for processing by Apply Engines configured as Kafka Consumer groups. Splitting the stream of published data to be consumed by Apply Engines that write (Apply) that data to target datastores of any type.
HDFS (Hadoop)
HDFS is a distributed Java-based file system typically used to store large volumes of data. HDFS is one of several subsystems that make up the framework called Hadoop: which is actually a framework of several subsystems that provide for parallel and distributed computation on large datasets:
- HDFS, A distributed file system that utilizes a cluster of machines to provide high-throughput access to data for Big Data applications.
- Map Reduce, the distributed processing framework that manages and controls processing across the cluster
Together, these subsystems allow for the distributed processing of large data sets scaling from single servers to thousands of machines. Each machine provides local computation and storage managed by the Hadoop software to deliver high-availability and high performance without relying on high cost hardware based high-availability.
Apply Engine
HDFS records can be written in a variety of formats including JSON and AVRO. In addition to automatically generating a JSON schema, when Confluent's Schema Registry is used for managing the schemas the Apply Engine will automatically register the Kafka topic schema's.
Replicator Engine
Apply Engine
HDFS records can be written in a variety of formats including JSON and AVRO. The Replicator automatically generates the JSON schemas, when Confluent's Schema Registry is used for managing the schemas the Apply Engine will automatically register the HDFS record schema and maintain those schemas automatically as the source tables evolve.