Mahout is project name for machine learning libraries for data in Hadoop. Essentially these are AI (Artificial Intelligence) techniques to build insights from the data, primarily: 'Recommendations', 'Classification' and 'Clustering'
This is a large scale 'log collection' framework while monitoring large distributed systems.
Flume is a service for collecting and aggregating large amount of 'streaming' data into HDFS. It is key in abstracting spikes in the rate of incoming data streams and can stream from multiple sources into Hadoop for analysis.
Spark is a in-memory cluster/distributed computing framework (similar to Map Reduce). Spark utilizes the OS buffer cache for data processing and hence in several use-cases several times faster than Hadoop Map-Reduce. While Map-Reduce writes to the disk and well suited for large data sets (common with ETL type of operation), Spark performs well with iterative data operations involving smaller data sets (especially machine learning algorithms). Spark support Scala, Java and Python.
Apache Hue provides the much sought after simplicity to perform common Hadoop operations through a common web user interface. From loading the data, to viewing and analyzing the data, hue offers a one stop UI to perform all the required steps.
In this blog I try to outline some of my learnings building a
managed cloud service ‘cloud-first’ and wanted to reflect and contrast against
the traditional on-premise development model. The views in this article are my
own, so feedback, comments are welcome
On-Premise/Traditional Product Development Traditional on-prem software development often revolved on the core
premises of Develop-Test-Deliver. Serviceability
(read as trace-ability, logging) although critical in any enterprise class
product, is often refined re-actively based on the nature of issues encountered
from the field. With the ability to ‘later’ release new versions of the product,
it is considered acceptable to have this improved upon in future
increments/versions. Automated Validation, Continuous
Integration although critical, is often looked upon as ‘process
optimization’ and ‘good to have’ towards improving product quality and
development efficiency. Since on-premise software release cycles typically
spans several wee…
IBM Watson is a cognitive computing system with natural language based analytics capabilities. Watson first shot to fame when it defeated the reigning champions in a game a jeapordy on live television. Jeapordy is a form of quiz which contestants are presented with general knowledge clues in the form
of answers, and must phrase their responses in question form.
The game showcased how Watson is able to handle not just a breadth of complex questions but also understand and respond to metaphors and puns in the questions.
What makes Watson different?
1) Can understand natural language: Can decipher data from natural language text sources including wikis, web pages, tweets...
2) Can learn from past results and improve: Gets better and able to learn from past mistakes.
3) Is domain agnostic: Not built for any specific area of business/domain.
It reminded me of the DIKW pyramid and the definition of 'Wisdom' in the context of computer systems: