[杂谈]Rdd运行-1

在调试的时候,发现有一个有趣的现象。task运行的时候,是分区分批次地执行。
每次执行都会调用我们运用在RDD上的所有的操作。分区之间能够并行执行,同一批次的数据运算当然需要顺序执行(不能并行计算)

比如,我有一个job

rdd.map(...).map(...).map(...).reduce(...)

数据源有5行数据,共2个分区

  1. task-0-分区0:

     第1行数据,map(...).map(...).map(...)
     第2行数据,map(...).map(...).map(...)
     reduce(...)
     继续处理分区0的数据...
    
  2. task-1-分区1:

     第4行数据,map(...).map(...).map(...)
     第5行数据,map(...).map(...).map(...)
     reduce(...)
     继续处理分区1的数据...
    
  3. task-2-DAG-scheduler:

     对task-0和task-1的结果,进行reduce
    

task-0和task-1属于同一个stage, 可以并行运行;

task-2属于另一个stage,但也可以和task-0,task-1并行执行;

task-0和task-1在运行过程中,一边执行一边把数据shuffle到task-2,task-2就可以开始reduce

不知道TaskScheduler有没有实现这种层次的并行。

测试数据: sale.svc

John,iphone Cover,9.99
John,Headphones,5.49
Jack,iphone Cover,9.99
Jill,Samsung Galaxy Cover,8.95
Bob,iPad Cover,5.49

测试代码:

object SaleStat {
	def main(args: Array[String]): Unit = {
		val conf = new SparkConf(true).setAppName("SaleStat").setMaster("local[4]").set("spark.executor.heartbeatInterval", 100.toString)
		val sc = new SparkContext(conf)
		val saleData = sc.textFile("PATH_TO_TEST_DATA/sale.svc")
			.map {
				line => {
					val words = line.split(",")
					println("map1: " + Thread.currentThread().getName + "," + line)
					words
				}
			}
			.map {
				words => val (user, product, price) = (words(0), words(1), words(2))
				println("map2: " + (Thread.currentThread().getName,user, product, price))
				(user, product, price)
			}
			.map(
				x => {
					val price: Double = x._3.toDouble
					println("map3: " + Thread.currentThread().getName + "," + x)
					(x._1, price)
				}
			)
			.reduce {
				(x, y) => {
					val total: Double = x._2 + y._2
					println("reduceByKey: " + (Thread.currentThread().getName, x,y,total))
					(x._1,total)
				}
			}
		
		println("saleData: " + saleData)			// 出发了rdd的计算
	}
}

输出

map1: Executor task launch worker-0,John,iphone Cover,9.99
map1: Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95
map2: (Executor task launch worker-0,John,iphone Cover,9.99)
map2: (Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-1,(Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-0,(John,iphone Cover,9.99)
map1: Executor task launch worker-1,Bob,iPad Cover,5.49
map1: Executor task launch worker-0,John,Headphones,5.49
map2: (Executor task launch worker-1,Bob,iPad Cover,5.49)
map2: (Executor task launch worker-0,John,Headphones,5.49)
map3: Executor task launch worker-1,(Bob,iPad Cover,5.49)
map3: Executor task launch worker-0,(John,Headphones,5.49)
reduceByKey: (Executor task launch worker-1,(Jill,8.95),(Bob,5.49),14.44)
reduceByKey: (Executor task launch worker-0,(John,9.99),(John,5.49),15.48)
map1: Executor task launch worker-0,Jack,iphone Cover,9.99
map2: (Executor task launch worker-0,Jack,iphone Cover,9.99)
map3: Executor task launch worker-0,(Jack,iphone Cover,9.99)
reduceByKey: (Executor task launch worker-0,(John,15.48),(Jack,9.99),25.47)
reduceByKey: (dag-scheduler-event-loop,(Jill,14.44),(John,25.47),39.91)
saleData: (Jill,39.91)

worker-0即task-0
worker-1即task-1
dag-scheduler-event-loop即task-2

这个可能是DAG的魅力之一了。RDD的运行是懒执行的,DAG能够整合rdd的运行过程,让很多的操作集中在一个线程里运行连续执行,避免了磁盘和网络io

原文地址:https://www.cnblogs.com/ivanny/p/spark_rdd_run_DAGScheduler.html