【慕课网实战】六、以慕课网日志分析为例进入大数据 Spark SQL 的世界

DataFrame它不是Spark SQL提出的，而是早起在R、Pandas语言就已经有了的。

A Dataset is a distributed collection of data：分布式的数据集

A DataFrame is a Dataset organized into named columns.

以列（列名、列的类型、列值）的形式构成的分布式数据集，按照列赋予不同的名称

student

id:int

name:string

city:string

It is conceptually equivalent to a table in a relational database

or a data frame in R/Python

RDD：

java/scala ==> jvm

python ==> python runtime

DataFrame:

java/scala/python ==> Logic Plan

DataFrame和RDD互操作的两种方式：

1）反射：case class 前提：事先需要知道你的字段、字段类型

2）编程：Row 如果第一种情况不能满足你的要求（事先不知道列）

3) 选型：优先考虑第一种

val rdd = spark.sparkContext.textFile("file:///home/hadoop/data/student.data")

DataFrame = Dataset[Row]

Dataset：强类型 typed case class

DataFrame：弱类型 Row

SQL:

seletc name from person; compile ok, result no

DF:

df.select("name") compile no

df.select("nname") compile ok

DS:

ds.map(line => line.itemid) compile no