spark 随意笔记

Spark comes with several sample programs. Scala, Java, Python and R examples are in the examples/src/main directory. To run one of the Java or Scala sample programs, use bin/run-example <class> [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-submit script for launching applications). For example,

./bin/run-example <class> [params] 用这个式子运行在spark根目录下

例子：./bin/run-example SparkPi 10

You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

spark master url

./bin/spark-shell --master local[2]

The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the --help option.

这--master选项指定master url 为一个分布式集群还是本地单线程的，或者local[N]本地N线程的。你应该开始使用本地测试，运行Spark shell --help选项

local 本地单线程
local[K] 本地多线程（指定K个内核）
local[*] 本地多线程（指定所有可用内核）
spark://HOST:PORT 连接到指定的 Spark standalone cluster master，需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群，需要指定端口。
yarn-client客户端模式连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
yarn-cluster集群模式连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。

spark-submit 可以用这个来提交java,python,R.

例如：./bin/spark-submit examples/src/main/python/pi.py 10

Spark例子位置

[root@master mllib]# locate SparkPi
/root/traffic-platform/spark-1.6.1/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java
/root/traffic-platform/spark-1.6.1/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

Python

>>>textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?

scala

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?

例子：

sc.textFile("/lwtest/test.txt").filter(labmbda line:"Spark" in line).count()

./bin/spark-shell
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:

scala> val textFile = sc.textFile("README.md")//目录为HDFS下的的文件
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

scala> textFile.count() // Number of items in this RDD，这个item是行数
res0: Long = 126

scala> textFile.first() // First item in this RDD

eg:

scala> val textFile=sc.textFile("/lwtest/test.txt")

scala> textFile.filter(line => line.contains("season")).count()

eg:Let’s say we want to find the line with the most words，求出现最多单词的这一行

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)

This first maps a line to an integer value, creating a new RDD. reduce is called on that RDD to find the largest line count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll use Math.max() function to make this code easier to understand:

文件每一行映射为一个整数，创建一个新的RDD，reduce被调用在那个RDD去找最大单词数的行数。要求map和reduce是scala语法函数，也可以用任何scala/java库。例如，用简单的调用函数Math.max()

scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15

One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

一种普遍数据流模式是MAPREDUCE，作为HADOOP受欢迎的方式，Spark能容易的实现MAPREDUCE流

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8

Here, we combined the flatMap, map, and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the collect action:

这里我们给合，flatMap,map,reducebykey转换，去计算每个单词数在这个文件作为一个RDD的（String,Int）对，要收集单词数，我们可以用collect动作

scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

linux find

http://www.cnblogs.com/peida/archive/2012/11/16/2773289.html

find . -name "*sbt*"

打包sbt与maven

scala读取文件：

默认是从hdfs读取文件，也可以指定sc.textFile("路径").在路径前面加上hdfs://表示从hdfs文件系统上读
本地文件读取 sc.textFile("路径").在路径前面加上file:// 表示从本地文件系统读，如file:///home/user/spark/README.md