Spark

初识 Spark 大数据处理，目前还只是小白阶段，初步搭建起运行环境，慢慢学习之。

本文熟悉下 Spark 数据处理的几个经典案例。

首先将 Scala SDK 的源码导入 IDEA，方便查看和调试代码，具体参考：intellij idea查看scala sdk的源代码

WordCount

WordCount 号称大数据界的 HelloWorld，初识大数据代码，从 WordCount 开始，其基本流程图如下：

相关代码如下：

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {
    def main(args: Array[String]) {
        if (args.length < 1) {
          System.err.println("Usage: <file>")
          System.exit(1)
        }

        // 创建 SparkConf
        val conf = new SparkConf()
        conf.setAppName("WordCount")
            .setMaster("local")

        // 创建 SparkContext
        val sc = new SparkContext(conf)

        // 数据处理
        val line = sc.textFile(args(0))
        line.flatMap(_.split("\s+"))
            .map((_, 1))
            .reduceByKey(_+_)
            .collect.foreach(println)

        // 关闭 SparkContext
        sc.stop
    }
}

注意几个问题：

正则表达式 "\s+" 匹配任意空白字符
SparkConf Name 和 Master Level 必须设置，本地调试应 local 或 local[i]，i 表示线程数（worker threads）
args(0) 表示待测试的文件，eg，"sqh.txt"
无论本地测试还是集群测试必须有 SparkContext 的实例

其中，textFile() 方法用于从文件创建 RDD，RDD 的每个元素对应文件中的每一行。源码定义如下：

def textFile(path : scala.Predef.String, minPartitions : scala.Int = { /* compiled code */ }) 
	: org.apache.spark.rdd.RDD[scala.Predef.String] = { /* compiled code */ }

词频统计示意图

其中，假定分片M=5，分区R=3，有6台机器，一台master，5台slaver。

参考：