Spark笔记03

  • 进一步介绍了Transformation操作
  • 进一步介绍了Action操作
  • 知识点解析:Function
  • 知识点解析:Suffle

Transformation

map

map(func) converts each element of the source RDD into a single element of the result RDD by applying a function.

也就是说,rdd里的一个元素,经过转换,生成另一个元素。这里的转换,可以是一个String对应到一个Int,如下:

val textFile = sc.textFile("data.txt")
val lineLengths = textFile.map(s => s.length)

也可以是一个Int,对应到一个(Int, Int),如下:

val a = sc.parallelize(1 to 9, 3)
def mapDoubleFunc(a : Int) : (Int,Int) = {
    (a,a*2)
}
val mapResult = a.map(mapDoubleFunc)

最终,输出是N维的,和输入相同。

flatMap

flatMap(func) works on a single element (as map) and produces multiple elements of the result (as mapPartitions).

flatMap transforms an RDD of length N into another RDD of length M.

举例如下,输入是一串String,输出可能是任意个String。维度可能发生变化。

// count each word's apperance in a file
val textFile = sc.textFile("data.txt")
val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
wordCounts.collect()

https://data-flair.training/blogs/apache-spark-map-vs-flatmap/

mapPartitions

mapPartitions(func) converts each partition of the source RDD into multiple elements of the result (possibly none).

val a = sc.parallelize(1 to 9, 3)
  def doubleFunc(iter: Iterator[Int]) : Iterator[(Int,Int)] = {
    var res = List[(Int,Int)]()
    while (iter.hasNext)
    {
      val cur = iter.next;
      res .::= (cur,cur*2)
    }
    res.iterator
  }
val result = a.mapPartitions(doubleFunc)
result.collect()

mapPartitions v.s. map

  • 操作对象不同:前者针对partition操作,后者针对每一个element操作,一般来说前者的loop更少,更加快。
  • 传参不同:前者需要一个func,且type为Iterator<T> => Iterator<U>,后者只需要一个func。
  • 结果维度不同:前者的维度可能是任意值,后者的维度一定是N,和传入参数维度相同。

https://blog.csdn.net/lsshlsw/article/details/48627737

aggretation

rdd1.union(rdd2):combine two rdds

union

rdd1.intersection(rdd2): pick the intersection only, like inner join

intersection

rdd1.subtract(rdd2): pick the one left has only, like left join

subtract

rdd1.cartesian(rdd2): double for loop

cartesian

rdd1.zip(rdd2): two rdds must have same structure, match one with another

zip

Action

spark_action

Understanding Function / Lambda

Spark编程中,使用了大量的函数式编程,一般可以分为两种:

  1. Anonymous function syntax (Lambda)
val lineLengths = lines.map(s => s.length)

这样写,有两个好处:

  • 一是简洁。
  • 另外,保证都是局部变量而不是全局变量,可以减小开销,且并行计算不容易出错。
  1. Static methods in a global singleton object
object MyFunctions {
  def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)

Understanding Suffle

Spark的运算是先transformation/map,然后action/reduce。对于action,一般来说是把每个node的结果拿来做aggregation,比如求和。但是有的操作,需要遍历所有node的所有key,比如排序。这时,就需要用到suffle

It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

会trigger suffle的操作有以下几类:

  • repartition: repartition and coalesce
  • ByKey operations (except for counting): groupByKey and reduceByKey
  • join operations: cogroup and join

需要注意的是,shuffle是非常expensive的操作,因为它involves disk I/O, data serialization, and network I/O。

Reference

原文地址:https://www.cnblogs.com/maxstack/p/13405622.html