Hadoop Quick Reference - Part 2


Apart from the fundamentals of Hadoop discussed earlier, what often intrigues a person new to Hadoop are the numerous open source projects/components based out of Hadoop. This article serves as a glossary for all of such projects and concepts:
  • YARN: 
Stands for Yet Another Resource Negotiator. YARN is responsible for managing 'reosurce requests' from other components/application. Resources here typically refer to CPU and Memory. In the earlier version of Hadoop, MapReduce (MR1) was responsible for both 'Resource Management' and 'Data Processing'. With YARN we now have a more modular/component-iced set of responsibilities across MR2 and YARN.


  • Ambari:
Ambari aims to offer a single web-based open source tool for provisioning, managing and monitoring Hadoop clusters. As part of monitoring it also offers means to perform retrospective analysis and cluster management.


  • Avro:
There was a need for applications (in various programming languages) to be able to serialize and exchange/pass data stored on Hadoop systems. Avro is a serialization system that enables this. Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient. Avro includes API's for Java, Python, Ruby, C, C++ and more. Data stored using Avro can easily be passed from a program written in one language to a program written in another language, even from a complied language like C to a scripting language like Pig.
  • Zookeeper
 ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

  • Oozie
 Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. A workflow is a collection of actions (e.g. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Directed Acyclic Graph). Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.


  • HBase
 Apache HBase is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. In essence HBase is a NOSQL DBMS system and does not support joins or subqueries.
  • Cassandra
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. In essence Cassandra is a NOSQL DBMS system and does not support joins or subqueries.
HBase and Cassandra are competing Open Source projects and are very similar in many respects. The key differences though are:
Cassandra requires that you identify some nodes as seed nodes, which serve as concentration points for intercluster communication. Meanwhile, on HBase, you must press some nodes into serving as master nodes, whose job it is to monitor and coordinate the actions of region servers.
Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software. HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks.
While neither Cassandra nor HBase support real transactions, both provide some level of consistency control.  

  • Hive
 There was a need to be able to access data using the well established standard SQL statement. This is a big deal as organizations could leverage resources with existing SQL skills. Their creation, called Hive™, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; HQL statements are broken down by the Hive service into MapReduce jobs and executed acros a Hadoop cluster

Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.
  • Pig
 Apache Pig allows you to write complex MapReduce transformations using a simple scripting language. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Comments

Popular posts from this blog

What does it take to develop 'cloud-first'?

Can IBM Watson gain wisdom?