Practical Real-time Data Processing and Analytics

上QQ阅读APP看书，第一时间看更新

Overview of Storm

Storm is an open source, distributed, resilient, real-time processing engine. It was started by Nathan Marz in late 2010. He was working at BackType. On his blog, he mentioned the challenges he faced while building Storm. It is a must read: http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html.

Here is the crux of the whole blog: initially, real-time processing was implemented like pushing messages into a queue and then reading the messages from it using Python or any other language and processing them one by one. The challenges with this approach are:

In case of failure of the processing of any message, it has to be put back into the queue for reprocessing
Keeping queues and the worker (processing unit) up and running all the time

What follows are two sparking ideas by Nathan that make Storm capable of being a highly reliable and real-time engine:

Abstraction: Storm is a distributed abstraction in the form of streams. Streams can be produced and processed in parallel. Spouts can produce new streams and a bolt is a small unit of processing in a stream. Topology is top-level abstraction. The advantage of abstraction here is that nobody need worry about what is going on internally, such as serialization/deserialization, sending/receiving messages between different processes, and so on. The user can focus on writing the business logic.
A guaranteed message processing algorithm is the second idea. Nathan developed an algorithm based on random numbers and XORs that would only require about 20 bytes to track each spout tuple, regardless of how much processing was triggered downstream.

In May 2011 BackType was acquired by Twitter. After becoming popular in public forums, Storm started to be called "real-time Hadoop". In September, 2011, Nathan officially released Storm. In September, 2013, he officially proposed Storm in Apache Incubator. In September, 2014, Storm became a top-level project in Apache.