SparkStreaming

伯克利大学，软件栈

从上图中可以看到Storm 和 SparkStreaming 以及 HadoopMR 分属3个不同的软件栈，而对于SparkStreaming来说，它使用的是Spark的软件栈，从而能够达到内存共用，变量共享的目的，而对于Storm来说，虽然延迟要比SparkStreaming 低，然而它所产生的结果，并不能直接被其他应用栈直接使用。进而完成后续操作。因此从这里来看Spark 提出的One stack rule the all 是相当优秀的。

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

现在数据源包括:Kafka ,Flume ,Twitter ,ZeroMQ ,MQTT ,TCP sockets,Akka Actor ,HDFS

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Spark Dtream

Spark 根据时间以及阈值，将stream分割成很多个DStream

DStream 输出

print, ForeachRDD ,saveAsObjectFiles, saveAsTextFiles, saveAsHadoopFiles

MetaDataCleaner 内方法

From WizNote