data:image/s3,"s3://crabby-images/5b271/5b271ddbd62674e44e107c2c9c3c0d2e11ce74e5" alt=""
OVERVIEW
What is Hadoop?
Hadoop is an open source software framework under Apache project that is used for scalable distributed computing. Hadoop frameworks allows for distributed data processing & storage of large dataset running across a cluster of nodes. Advantage of Hadoop over other high performance computing frameworks is data locality. That is data & computation power resides in the same node as compared to a centralized data storage.
The three main core components of Hadoop are
- Data Processing Engine : MapReduce & Spark are the two main processing engines for data crunching
- Resource Management : Yet Another Resource Negotiator (YARN) provides centralized resource management & governance across Hadoop components
- File System : Hadoop’s primary file system is Hadoop Distributed File System (HDFS) which uses Master/Slave architecture. It is a logical abstraction layer
data:image/s3,"s3://crabby-images/b0454/b04547c8914e742775cfaa885fe00663b0280488" alt=""
The four main Hadoop daemons are
- Name Node : Stores cluster configuration metadata & runs on Master Node
- Data Node : Stores client data & performs read/write operation. It runs on Worker Nodes
- Resource Manager : Responsible for managing resources across the cluster, it runs on Master Node
- Node Manager : Responsible for managing resources in individual Worker Node
data:image/s3,"s3://crabby-images/22154/22154ac94cdb1417bd86b1ba23239fb1076321f3" alt=""
Hadoop Ecosystems
- Core Services: Data processes such as MapRedce, HDFS, YARN & Libraries
- Management Service: Management & monitoring such as Ambari, Oozie, Zookeeper
- Data Access Service: Data access & transformation such as Pig, Hive & Spark
- Data Ingestion Service: Heterogeneous data load such as Flume, Sqoop & Kafka
data:image/s3,"s3://crabby-images/0dea1/0dea1f0fabc5c2da6645ab4f19ba3943c8be1337" alt=""
data:image/s3,"s3://crabby-images/b4871/b487109296f8f0948eac1c104bebf71c967ca487" alt=""
data:image/s3,"s3://crabby-images/a7c6d/a7c6d9f0fe16817246db49332ee680e116a16d50" alt="Image result for Hortonworks"
The content of this Hadoop implementation is based on Hortonworks Hadoop Data Platform (HDP) Ecosystem. This is 100% open source software that can be downloaded from the vendor site. In subsequent sessions the following topics will be covered