Big Data Lecture 4

Getting Started with Apache Hadoop

Hadoop + Spark practice the most practical machine learning algorithms, Algorithms + Software to create the core characteristics of competitiveness.

Introducing Apache Hadoop

Hadoop is an open source software framework from the Apache Foundation
- Enables distributed processing on large clusters with thousands of nodes and petabytes of data
- For avoiding significant network I/O, Hadoop uses the move computation to the data approach
- Developments only need to focus on business logic rather than doing the heavy lifting of distributing data, distributing code for parallel processing, and handling failures
Java-based software framework for distributed processing of data
The version of Apache Hadoop is divided into two generations, the first generation is called Hadoop 1.0 and the second generation is called Hadoop 2.0. The second generation, consisting of 2 versions 0.23.x and 2.x, is completely different from Hadoop 1.0, and is a new set of architecture.

Evolution of Apache Hadoop

Four parts of Apache Hadoop

Hadoop Common: provides public applications
Hadoop Distributed File System(HDFS): used to store data
Yet Another Resource Negotiator(YARN): used to manage the resources of the clusters
MapReduce: provides a parallel computing framework for processing data

Structure of Hadoop

Typical characteristics of Apache Hadoop

Commodity
Robust
Scalable
Simple
Data locality

Hadoop Distributed File System

HDFS is a distributed filesystem that provided high scalability and reliability on large clusters of commodity hardware
HDFS are divided into large blocks that are typically 128MB in size and distributed across the cluster
Each block is replicated (typically 3 times) to handle hardware failures and block placement exposed by NameNode so that computation can be moved to data with MapReduce framework.

Structure of HDFS

Introducing HDFS

HDFS uses a master/slave architecture, consisting of a NameNode and several DataNodes
The NameNode acts as the primary server
The DataNode is responsible for handling the read and write requests for the stored data

Features of HDFS

Unlimited scalability

High availability
Data integrity
HDFS ACLs
HDFS rebalancing
Caching
APIs
Data Encryption
NFS access
Write Once and Read Many

Storage options on Hadoop

Hadoop can store all the data regradless of what format the data is stored in.

MapReduce Parallel Computing Framework

Introducing MapReduce

MapReduce (MR) is a framework to write analytical applications in batch mode on terabytes or petabytes of data stored on HDFS.
An MR job usually processes each block (excluding replicas) of input file(s) in HDFS with the mapper tasks in a parallel manner.
The MR framework sorts and shuffles the outputs of the mappers to the reduce tasks in order to produce the output.
The framework takes care of computing the number of tasks needed, scheduling tasks, monitoring them, and re-executing them if they fail. The developer needs to focus only on writing the business logic.
MR distributes tasks to each node in the cluster and processes large data sets in parallel in a highly fault-tolerant manner.
In the early MapReduce framework, it was composed of a single JobTracker process running on the master node and a TaskTracker process running on each cluster slave node.
The master JobTracker is responsible for scheduling all the TaskTasks of the JobJob, which are distributed on different slave nodes.
The master node monitors the execution of each Task through the heartbeat mechanism and re-executes the previously failed tasks.

Big Data Lecture 3