idea编写wordcount程序及spark-submit运行

1、idea中新建一个Maven项目

自定义名称

 

 2、编辑pom文件,包括spark、scala、hadoop等properties版本信息、dependencies依赖、和plugins 插件信息

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>sparktest</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.version>2.11.8</scala.version>
        <spark.version>2.4.5-hw-ei-302002</spark.version>
        <hadoop.version>3.1.1-hw-ei-302002</hadoop.version>
        <hbase.version>2.2.3-hw-ei-302002</hbase.version>
    </properties>

    <dependencies>
        <!--导入scala的依赖 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!--导入spark的依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!--指定hadoop-client API的版本 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>

            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <configuration>
                    <recompileMode>modified-only</recompileMode>
                </configuration>
                <executions>
                    <execution>
                        <id>main-scalac</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

        </plugins>
        <directory>target</directory>
        <outputDirectory>target/classes</outputDirectory>
        <testOutputDirectory>target/test-classes</testOutputDirectory>
    </build>


</project>

 创建derectory,命名:scala。看颜色同java不一样,此时还不是源码包,不能添加class

 

3、点击reimport ,也就是小圆圈,变成源码包

 

4、创建scala class,编写程序

 

 创建object,自定义名称

 5、编辑好程序,双击package打包,即可生成jar包

package cn.study.spark.scala

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext, rdd}

object WordCount{
  def main(args: Array[String]): Unit = {
    //创建spark配置,设置应用程序名称
    val conf = new SparkConf().setAppName("ScalaWordCount")

    //创建spark执行的入口
    var sc = new SparkContext(conf)

    //指定读取创建RDD(弹性分布式数据集)的文件

    //sc.textFile(args(0)).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false).saveAsTextFile(args(1))
    val lines:RDD[String] = sc.textFile(args(0))
    //切分压平
    val words:RDD[String] = lines.flatMap(_.split(" "))
    //将单词和一组合
    val wordAndOne:RDD[(String,Int)] = words.map((_,1))
    //按key进行聚合
    val reduced:RDD[(String,Int)] = wordAndOne.reduceByKey(_+_)
    val sorted:RDD[(String,Int)]  = reduced.sortBy(_._2,false)
    //将结果保存到hdfs中
    sorted.saveAsTextFile(args(1))
    //释放资源
    sc.stop()

  }

}

6、jar包上传集群节点,hdfs上传存放word的文件

7、运行程序跑起来

spark-submit --master yarn --deploy-mode client --class cn.study.spark.scala.WordCount sparktest-1.0-SNAPSHOT.jar /tmp/test/t1.txt /tmp/test/t2

 8、查看结果

欢迎各路侠客多多指教^_^
原文地址:https://www.cnblogs.com/cailingsunny/p/14695341.html