Hadoop Quick Reference - Part 3

It is recommended that you go through the Hadoop Quick Ref - 1 and Quick Ref - 2 before going through the below article.

  • Mahout
Mahout is project name for machine learning libraries for data in Hadoop. Essentially these are AI (Artificial Intelligence)  techniques to build insights from the data, primarily: 'Recommendations', 'Classification' and 'Clustering'

  • Chukwa
 This is a large scale 'log collection' framework while monitoring large distributed systems.

  • Flume
Flume is a service for collecting and aggregating large amount of 'streaming' data into HDFS. It is key in abstracting spikes in the rate of incoming data streams and can stream from multiple sources into Hadoop for analysis.

  • Spark
 Spark is a in-memory cluster/distributed computing framework (similar to Map Reduce). Spark utilizes the OS buffer cache for data processing and hence in several use-cases several times faster than Hadoop Map-Reduce. While Map-Reduce writes to the disk and well suited for large data sets (common with ETL type of operation), Spark performs well with iterative data operations involving smaller data sets (especially machine learning algorithms). Spark support Scala, Java and Python.
  • Hue
Apache Hue provides the much sought after simplicity to perform common Hadoop operations through a common web user interface. From loading the data, to viewing and analyzing the data, hue offers a one stop UI to perform all the required steps.


Popular posts from this blog

What does it take to develop 'cloud-first'?

Can IBM Watson gain wisdom?