Apache Hadoop is an open-source software framework that supports distributed storage and processing of large amounts of data on scalable clusters of commodity hardware. It consists of two primary components:
- HDFS – the Hadoop Distributed File System, which breaks up data files into small blocks stored on multiple nodes in the cluster
- MapReduce – a programming model for dividing the data processing among many nodes on the cluster and then bringing the data results back together to achieve higher throughput via parallelization. MapReduce consists of two main programmer-written functions, map and reduce.
The Hadoop framework handles data distribution, job execution, and job/node failures, so that programmers writing MapReduce jobs only need to be concerned with the actual processing of the data.