Monday, November 3, 2014

Hadoop Quick Reference - Part 1

How it all started?
Although businesses and organizations are waking up only now to the realties on the need to manage large amounts of data, this has been the #1 focus area for web search companies vis-a-vis Google, Yahoo. Having to manage billions of indexes and web page information is a daunting task with traditional RDBMS and high-availability technologies. During the early part of the decade (2002-2005) Yahoo started working on a project to handle this challenge which concluded in what came to be known as Hadoop.  The cost of high available/reliable hardware was expensive and starting to breach the economics of scale to store and manage tis kind of data. Hadoop was built on the premise that 'hardware failure is imminent' and hence the question was on how this is to be handled in the higher software layers. This premise has lead to the emergence of maturing of technologies around the Hadoop framework.

What does Hadoop offer?
Hadoop in essence offers the software framework for distributed storage of large data sets on clusters of computer systems (could be cheap commodity computer hardware systems). The very architecture makes it ideal for high-availability and rapid data processing (via distributed processing across the cluster).

A good introduction video:

Key Concepts:
Hadoop at its core is defined by:
1) A Data Storage Strategy:
This is called the HDFS (Hadoop Distrubuted File System)
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. The data is broken down into blocks (files) and stored on data nodes. The blocks are replicated on multiple data nodes for reliability. One of the machines referred to as the 'Name Node'  that manages the file system namespace and the mapping of the files to the blocks that belong to it. HDFS continuously monitors the data replicas in the system and check for corruption or datanode/disk failures. Since the HDFS framework (software) takes care of automatically recovering data from failed replicas, the failed nodes/hardware need not be replaced immediately (unlike traditional HA technologies where hardware failure needs to be addressed immediately)
Image Courtesy: ibm

2) A Data Processing Strategy:
This is MapReduce. MapReduce is essentially the 'computation' part for processing the data. The MapReduce job is co-located across each of the data nodes. This primarily involves the 'Map' part which is applying a particular search query/function across all of the data node and the 'Reduce' part which is aggregating all of the results from the 'Map' phase.

No comments:

Post a Comment


Predictive Analytics ....... what next?

I have often pondered on this question, wondering what could possibly be the next big quantum leap in the real of data and data centric de...