COMP9313 week3b Resilient Distributed Dataset (RDD) 下 Pyspark

Resilient Distributed Dataset (RDD)

https://drive.google.com/drive/folders/13_vsxSIEU9TDg1TCjYEwOidh0x3dU6es

https://www.cse.unsw.edu.au/~cs9313/20T2/slides/L3.pdf

setting

wordCount MapReduce

 

 

 

Lineage:

  1)在此结构下出错丢失partition文件: r5.todebugstring()

  2)如果worker坏掉了,并且造成数据丢失,可以从原始数据集中恢复,并通过lineage结构

  3)  如果driver坏掉了,有back up的driver

 

DAG :  说实话 不知道在讲啥 偷懒太严重了 两小时准备了只11张PPT 全是字

  DAG and RDD are two core components in spark

  1) stage 1: no shuffling (narrow transformations)  stage2: shuffling stage3: shuffling (Wide transformation)

  

原文地址:https://www.cnblogs.com/ChevisZhang/p/13152371.html