HDFS is a distributed Java-based file system typically used to store large volumes of data. HDFS is one of several subsystems that make up the framework called Hadoop: which is actually a framework of several subsystems that provide for parallel and distributed computation on large datasets:
- HDFS, A distributed file system that utilizes a cluster of machines to provide high-throughput access to data for Big Data applications.
- Map Reduce, the distributed processing framework that manages and controls processing across the cluster
Together, these subsystems allow for the distributed processing of large data sets scaling from single servers to thousands of machines. Each machine provides local computation and storage managed by the Hadoop software to deliver high-availability and high performance without relying on high cost hardware based high-availability.
Apply Engine
HDFS records can be written in a variety of formats including JSON and AVRO. In addition to automatically generating a JSON schema, when Confluent's Schema Registry is used for managing the schemas the Apply Engine will automatically register the Kafka topic schema's.
Replicator Engine
Apply Engine
HDFS records can be written in a variety of formats including JSON and AVRO. The Replicator automatically generates the JSON schemas, when Confluent's Schema Registry is used for managing the schemas the Apply Engine will automatically register the HDFS record schema and maintain those schemas automatically as the source tables evolve.