Sunday, March 15, 2015

Hadoop Quick Reference - Part 3

It is recommended that you go through the Hadoop Quick Ref - 1 and Quick Ref - 2 before going through the below article.

  • Mahout
Mahout is project name for machine learning libraries for data in Hadoop. Essentially these are AI (Artificial Intelligence)  techniques to build insights from the data, primarily: 'Recommendations', 'Classification' and 'Clustering'

  • Chukwa
 This is a large scale 'log collection' framework while monitoring large distributed systems.

  • Flume
Flume is a service for collecting and aggregating large amount of 'streaming' data into HDFS. It is key in abstracting spikes in the rate of incoming data streams and can stream from multiple sources into Hadoop for analysis.

  • Spark
 Spark is a in-memory cluster/distributed computing framework (similar to Map Reduce). Spark utilizes the OS buffer cache for data processing and hence in several use-cases several times faster than Hadoop Map-Reduce. While Map-Reduce writes to the disk and well suited for large data sets (common with ETL type of operation), Spark performs well with iterative data operations involving smaller data sets (especially machine learning algorithms). Spark support Scala, Java and Python.
  • Hue
Apache Hue provides the much sought after simplicity to perform common Hadoop operations through a common web user interface. From loading the data, to viewing and analyzing the data, hue offers a one stop UI to perform all the required steps.

No comments:

Post a Comment


Predictive Analytics ....... what next?

I have often pondered on this question, wondering what could possibly be the next big quantum leap in the real of data and data centric de...