The three main core components of Hadoop are
- Data Processing Engine : MapReduce & Spark are the two main processing engines for data crunching
- Resource Management : Yet Another Resource Negotiator (YARN) provides centralized resource management & governance across Hadoop components
- File System : Hadoop’s primary file system is Hadoop Distributed File System (HDFS) which uses Master/Slave architecture. It is a logical abstraction layer
The four main Hadoop daemons are
- Name Node : Stores cluster configuration metadata & runs on Master Node
- Data Node : Stores client data & performs read/write operation. It runs on Worker Nodes
- Resource Manager : Responsible for managing resources across the cluster, it runs on Master Node
- Node Manager : Responsible for managing resources in individual Worker Node
Hadoop Ecosystems
- Core Services: Data processes such as MapRedce, HDFS, YARN & Libraries
- Management Service: Management & monitoring such as Ambari, Oozie, Zookeeper
- Data Access Service: Data access & transformation such as Pig, Hive & Spark
- Data Ingestion Service: Heterogeneous data load such as Flume, Sqoop & Kafka
The content of this Hadoop implementation is based on Hortonworks Hadoop Data Platform (HDP) Ecosystem. This is 100% open source software that can be downloaded from the vendor site. In subsequent sessions the following topics will be covered