sparkstreaming+socket workCount 小案例

Consumer代码

import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.storage.StorageLevel
object NetWorkStream {
def main(args: Array[String]): Unit = {
//创建sparkConf对象
var conf=new SparkConf().setMaster("spark://192.168.177.120:7077").setAppName("netWorkStream");
//创建streamingContext:是所有数据流的一个主入口
//Seconds(1)代表每一秒，批量执行一次结果
var ssc=new StreamingContext(conf,Seconds(1));
//从192.168.99.143接受到输入数据
var lines= ssc.socketTextStream("192.168.99.143", 9999);
//计算出传入单词的个数
var words=lines.flatMap { line => line.split(" ")}
var wordCount= words.map { w => (w,1) }.reduceByKey(_+_);
//打印结果
wordCount.print();
ssc.start();//启动进程
ssc.awaitTermination();//等待计算终止
}

在另一台机器上出入

nc -lk 9999
zhang xing sheng zhang

消费者终端会显示消费结果

17/03/25 14:10:33 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 128.0 (TID 134) in 30 ms on 192.168.177.120 (1/1)
17/03/25 14:10:33 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 128.0, whose tasks have all completed, from pool
17/03/25 14:10:33 INFO scheduler.DAGScheduler: ResultStage 128 (print at NetWorkStream.scala:18) finished in 0.031 s
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Job 64 finished: print at NetWorkStream.scala:18, took 0.080836 s
17/03/25 14:10:33 INFO spark.SparkContext: Starting job: print at NetWorkStream.scala:18
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Got job 65 (print at NetWorkStream.scala:18) with 1 output partitions
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Final stage: ResultStage 130 (print at NetWorkStream.scala:18)
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 129)
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Missing parents: List()
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Submitting ResultStage 130 (ShuffledRDD[131] at reduceByKey at NetWorkStream.scala:17), which has no missing parents
17/03/25 14:10:33 INFO memory.MemoryStore: Block broadcast_67 stored as values in memory (estimated size 2.8 KB, free 366.2 MB)
17/03/25 14:10:33 INFO memory.MemoryStore: Block broadcast_67_piece0 stored as bytes in memory (estimated size 1711.0 B, free 366.2 MB)
17/03/25 14:10:33 INFO storage.BlockManagerInfo: Added broadcast_67_piece0 in memory on 192.168.177.120:37341 (size: 1711.0 B, free: 366.3 MB)
17/03/25 14:10:33 INFO spark.SparkContext: Created broadcast 67 from broadcast at DAGScheduler.scala:1012
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 130 (ShuffledRDD[131] at reduceByKey at NetWorkStream.scala:17)
17/03/25 14:10:33 INFO scheduler.TaskSchedulerImpl: Adding task set 130.0 with 1 tasks
17/03/25 14:10:33 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 130.0 (TID 135, 192.168.177.120, partition 1, NODE_LOCAL, 6468 bytes)
17/03/25 14:10:33 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 135 on executor id: 0 hostname: 192.168.177.120.
17/03/25 14:10:33 INFO storage.BlockManagerInfo: Added broadcast_67_piece0 in memory on 192.168.177.120:45262 (size: 1711.0 B, free: 366.3 MB)
17/03/25 14:10:33 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 130.0 (TID 135) in 14 ms on 192.168.177.120 (1/1)
17/03/25 14:10:33 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 130.0, whose tasks have all completed, from pool
17/03/25 14:10:33 INFO scheduler.DAGScheduler: ResultStage 130 (print at NetWorkStream.scala:18) finished in 0.014 s
17/03/25 14:10:33 INFO scheduler.DAGScheduler: Job 65 finished: print at NetWorkStream.scala:18, took 0.022658 s
-------------------------------------------
Time: 1490422233000 ms
-------------------------------------------
(xing,1)
(zhang,2)
(sheng,1)

备注：

var conf=new SparkConfig();

new StreamingContext(conf,Seconds(1));//创建context

定义上下文之后，你应该做下面事情
After a context is defined, you have to do the following.
根据创建DStream定义输入数据源
Define the input sources by creating input DStreams.
定义计算方式DStream转换和输出
Define the streaming computations by applying transformation and output operations to DStreams.
使用streamingContext.start()启动接受数据的进程
Start receiving data and processing it using streamingContext.start().
等待进程结束
Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
手动关闭进程
The processing can be manually stopped using streamingContext.stop().

要点

一旦一个上下文启动，不能在这个上下文中设置新计算或者添加
Once a context has been started, no new streaming computations can be set up or added to it.
一旦一个上下文停止，就不能在重启
Once a context has been stopped, it cannot be restarted.
在同一时间一个jvm只能有一个StreamingContext 在活动
Only one StreamingContext can be active in a JVM at the same time.//ssc.stop(false)
在StreamingContext 上使用stop函数，同事也会停止sparkContext,仅仅停止StreamingContext，在调用stopSparkContext设置参数为false
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
一个SparkContext 可以创建多个streamingContext和重用，只要在上一个StreamingContext停止前创建下一个StreamingContext
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.