What is Hadoop?

3 min readNov 28, 2022

Data can no longer fit in memory on one machine (monolithic), so a new way of computing was devised using many computers to process the data (distributed). Such a group is called a cluster, which makes up server farms. All of these servers have to be coordinated in the following ways: partition data, coordinate computing tasks, handle fault tolerance/recovery and allocate capacity to process.

Hadoop is an open-source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is comprised of 3 main components:
• Hadoop Distributed File System (HDFS): A distributed file system that provides high throughput access to application data by partitioning data across many machines
• YARN: A framework for job scheduling and cluster resource management (task coordination)
• MapReduce: YARN-based system for parallel processing of large data sets on multiple machines

HDFS

Each disk on a different machine in a cluster is comprised of 1 master node; the rest are data nodes. The master node manages the overall file system by storing the directory structure and metadata of the files. The data nodes physically store the data. Large files are broken up/distributed across multiple machines, which are replicated across 3 machines to provide fault tolerance.

MapReduce

Parallel programming paradigm which allows for processing of huge amounts of data by running processes on multiple machines. Defining a MapReduce job requires two stages: map and reduce.
• Map: operation to be performed in parallel on small portions of the dataset. the output is a key-value pair <K,V>
• Reduce: operation to combine the results of Map

YARN- Yet Another Resource Negotiator

Coordinates tasks running on the cluster and assign new nodes in case of failure. Comprised of 2 subcomponents: the resource manager and the node manager. The resource manager runs on a single master node and schedules tasks across nodes. The node manager runs on all other nodes and manages tasks on the individual node.

Hadoop Architecture

Hadoop Ecosystem

An entire ecosystem of tools has emerged around Hadoop, which are based on interacting with HDFS.

Hive: Data warehouse software built on top of Hadoop that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL-like queries (HiveQL). Hive abstracts away underlying MapReduce jobs and returns HDFS in the form of tables (not HDFS).

Pig: High-level scripting language (Pig Latin) that enables writing complex data transformations. It pulls unstructured/incomplete data from sources, cleans it, and places it in a database/data warehouse. Pig performs ETL in the data warehouse while Hive queries from the data warehouse to perform analysis (GCP: DataFlow).

Spark: Framework for writing fast, distributed programs for data processing and analysis. Spark solves similar problems as Hadoop MapReduce but with a fast in-memory approach. It is a unified engine that supports SQL queries, streaming data, machine learning, and graph processing. Can operate separately from Hadoop but integrates well with Hadoop. Data is processed using Resilient Distributed Datasets (RDDs), which are immutable, lazily evaluated, and tracks lineage.

Hbase: Non-relational, NoSQL, column-oriented database management system that runs on top of HDFS. Well suited for sparse data sets (GCP: BigTable)

Flink/Kafka: Stream processing framework. Batch streaming is for bounded, finite datasets, with periodic updates, and delayed processing. Stream processing is for unbounded datasets, with continuous updates, and immediate processing. Stream data and stream processing must be decoupled via a message queue. Can group streaming data (windows) using tumbling (non-overlapping time), sliding (overlapping time), or session (session gap) windows.

Beam: Programming model to define and execute data processing pipelines, including ETL, batch, and stream (continuous) processing. After building the pipeline, it is executed by one of Beam’s distributed processing backends (Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow). Modeled as a Directed Acyclic Graph (DAG).

Oozie: Workflow scheduler system to manage Hadoop jobs

Sqoop: Transferring framework to transfer large amounts of data into HDFS from relational databases (MySQL)