spark-sql

本文用到的测试数据person.txt

lijing　　29

guodegang　　45

heyunwei　　30

yueyunpeng　　100

rdd的分区数量，读取hdfs文件，默认是文件个数

rdd生成方式：

1) 并行化

2) 通过读取文件api方法生成

DataFrame的基础操作，详见官方API文档。将DataFrame存储详见官方API文档

DataFrame生成方式：

1)从rdd生成

2)读取hive表生成

创建hive表：

1)执行hive脚本

import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext._
//hive的多行脚本必须分开执行，否则会报错(无法执行多行脚本)
//sql("use dev;create table person(name string,age int)") 会报错
sql("use dev") //指定数据库
sql("create table person(name string,age int)") //创建表
sql("load data local inpath 'person.txt' into table person") //导入数据

2)通过dataframe创建

import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
case class Person(name:String,age:Int)
val rdd_person=sc.textFile("example/person.txt") //此处是hdfs路径
val person=rdd_person.map(line => line.split("	")).map(line => Person(line(0),line(1).toInt))

//创建dataframe
val hive_person = sqlContext.createDataFrame(person)
hive_person.registerTempTable("hive_person")
sqlContext.sql("use dev")
/*
创建一个managed表
如果需要指定字段进行分区，需要调用方法partitionBy(colNames: String*)；
mode方法用来指定存储方式：
SaveMode.Overwrite: overwrite the existing data.
SaveMode.Append: append the data.
SaveMode.Ignore: ignore the operation (i.e. no-op).
SaveMode.ErrorIfExists: default option, throw an exception at runtime.
*/

hive_person.write.mode(org.apache.spark.sql.SaveMode.Overwrite).saveAsTable("hive_person")


sqlContext.sql("select * from hive_person limit 1").collect //查看下表是否创建成功，是否有数据

在上述脚本中用到了toInt函数，如果数据有异常，无法转化为int，可自定义函数，来进行处理

def parseInt(s: String): Int = try { s.toInt } catch { case _ => 0 }
parseInt("a")

环境搭建

scala的Intelij IDEA环境搭建

在idea中最后打包jar包的时候，为了避免把目标环境已有的包再打包到jar包中导致体积过大，可在pom.xml中相应的依赖中加入

<scope>provided</scope>

或在菜单File-Project Structure中将Output Layout中多余的删掉

提交jar包到spark上

spark-submit --class 类名 --jar jar包参数

如果类里需要传参，则"--jar"要去掉，否则报错

参考：

http://www.cnblogs.com/shishanyuan/p/4699644.html

http://lxw1234.com/archives/category/spark

https://taoistwar.gitbooks.io/spark-developer-guide/content/