之前的运行结果比对发现,有1个函数的作用在2个job里面是相同的,但是对应的计算时间却差太远
于是把4个job分开运行.虽说使用的数据不同,但是生成数据的生成器是相同的,数据排布差距不大,数据量也是相同的.
以下是这4个job的运行时间表
Details for pure RDD job
- Status: SUCCEEDED
- Completed Stages: 7
Enable zooming
Completed Stages (7)
Stage Id | Description | Submitted | Duration | Tasks: Succeeded/Total | Input | Output | Shuffle Read | Shuffle Write |
---|---|---|---|---|---|---|---|---|
6 | 2019/01/30 15:58:43 | 94 ms |
41/41
|
235.4 KB | ||||
5 | 2019/01/30 15:58:42 | 0.4 s |
41/41
|
382.9 KB | 235.4 KB | |||
4 | 2019/01/30 15:58:42 | 0.1 s |
41/41
|
99.2 KB | 246.0 KB | |||
2 | 2019/01/30 15:58:41 | 1 s |
41/41
|
765.8 KB | 99.2 KB | |||
1 | 2019/01/30 15:58:38 | 3 s |
41/41
|
750.1 KB | ||||
0 | 2019/01/30 15:58:38 | 3 s |
1/1
|
15.7 KB | ||||
3 | 2019/01/30 15:58:38 | 4 s |
41/41
|
137.0 KB |
可以看到,产品信息被转换为pairRDD要花4秒,城市信息和点击信息要花3秒.而之前的实验的运行时间却是零点几秒.说明这里可能有自动缓存,把之前的运行结果直接拿来用了
这3个步骤是并行的,花的时间也缩小了.运行时间:5秒
Details for pure RDD job with map join
- Status: SUCCEEDED
- Completed Stages: 3
Enable zooming
Completed Stages (3)
Stage Id | Description | Submitted | Duration | Tasks: Succeeded/Total | Input | Output | Shuffle Read | Shuffle Write |
---|---|---|---|---|---|---|---|---|
3 | 2019/01/30 16:00:23 | 0.2 s |
41/41
|
246.7 KB | ||||
2 | 2019/01/30 16:00:22 | 0.5 s |
41/41
|
477.6 KB | 246.8 KB | |||
1 | 2019/01/30 16:00:17 | 5 s |
41/41
|
478.2 KB |
估计是map join很占内存的理由,承载城市信息和点击记录的mapToPair运行时间被延长了.运行时间:6秒
Details for original job
- Status: SUCCEEDED
- Completed Stages: 7
Enable zooming
Completed Stages (7)
Stage Id | Description | Submitted | Duration | Tasks: Succeeded/Total | Input | Output | Shuffle Read | Shuffle Write |
---|---|---|---|---|---|---|---|---|
6 | 2019/01/30 16:04:04 | 0.8 s |
200/200
|
865.5 KB | ||||
5 | 2019/01/30 16:03:58 | 6 s |
200/200 (2 failed)
|
899.9 KB | 869.3 KB | |||
3 | 2019/01/30 16:03:56 | 1 s |
200/200
|
224.2 KB | 733.2 KB | |||
2 | 2019/01/30 16:03:55 | 2 s |
41/41
|
766.0 KB | 224.3 KB | |||
4 | 2019/01/30 16:03:50 | 3 s |
41/41
|
159.9 KB | ||||
1 | 2019/01/30 16:03:49 | 6 s |
41/41
|
750.3 KB | ||||
0 | 2019/01/30 16:03:49 | 3 s |
1/1
|
15.7 KB |
数据量最多的点击记录mapToPair耗费时间最长,为6秒
其他的对应操作耗时都不低于纯RDD版本对应操作,特别是collect前面2个操作,纯RDD程序不用1秒就能跑完.
据前面的too many open files错误,可以推定SQL操作是在本地创建文件读写的,加上某些SQL语句对业务处理步骤不如RDD简洁,严重拖慢了运行时间,运行时间:16秒
Details for pure sparkSQL job
- Status: SUCCEEDED
- Completed Stages: 7
Enable zooming
Completed Stages (7)
Stage Id | Description | Submitted | Duration | Tasks: Succeeded/Total | Input | Output | Shuffle Read | Shuffle Write |
---|---|---|---|---|---|---|---|---|
6 | 2019/01/30 16:08:23 | 0.8 s |
200/200
|
869.0 KB | ||||
5 | 2019/01/30 16:08:21 | 2 s |
200/200 (1 failed)
|
894.1 KB | 870.2 KB | |||
3 | 2019/01/30 16:08:20 | 1 s |
200/200
|
224.2 KB | 733.4 KB | |||
2 | 2019/01/30 16:08:18 | 1 s |
200/200
|
405.2 KB | 224.6 KB | |||
4 | 2019/01/30 16:08:01 | 4 s |
41/41
|
159.9 KB | ||||
1 | 2019/01/30 16:08:01 | 17 s |
1/1
|
4.0 KB | ||||
0 | 2019/01/30 16:08:01 | 6 s |
41/41 (1 failed)
|
401.8 KB |
本身sparkSQL就很慢,前面2步操作被SQL化之后更慢了...运行时间:22秒