离线电商数仓（六十六）之数据质量监控（二）Griffin（三）安装及使用（一）

2.1 安装前环境准备

2.1.1 安装 ES5.2

1）上传 elasticsearch-5.2.2.tar.gz 到 hadoop102 的/opt/software 目录，并解压到/opt/module目录

[atguigu@hadoop102 software]$ tar -zxvf elasticsearch-5.2.2.tar.gz -C /opt/module/

2）修改/opt/module/elasticsearch-5.2.2/config/elasticsearch.yml 配置文件

[atguigu@hadoop102 config]$ vim elasticsearch.yml 
network.host: hadoop102 
http.port: 9200 
http.cors.enabled: true 
http.cors.allow-origin: "*" 
bootstrap.memory_lock: false 
bootstrap.system_call_filter: false

3）修改 Linux 系统配置文件/etc/security/limits.conf

[atguigu@hadoop102 elasticsearch-5.2.2]$ sudo vim /etc/security/limits.conf
#添加如下内容 
* soft nproc 65536 
* hard nproc 65536 
* soft nofile 65536 
* hard nofile 65536

[atguigu@hadoop102 elasticsearch-5.2.2]$ sudo vim /etc/sysctl.conf

#添加 
vm.max_map_count=655360

[atguigu@hadoop102 elasticsearch-5.2.2]$ sudo vim /etc/security/limits.d/90-np

#修改配置 
* soft nproc 2048

[atguigu@hadoop102 elasticsearch-5.2.2]$ sudo sysctl -p

4）需要重新启动虚拟机

[atguigu@hadoop102 elasticsearch-5.2.2]$ su root 
root@hadoop102 elasticsearch-5.2.2]# reboot

5）在/opt/module/elasticsearch-5.2.2 路径上，启动 ES

[atguigu@hadoop102 elasticsearch-5.2.2]$ nohup /opt/module/elasticsearch-5.2.2/bin/elasticsearch &

6）在 ES 里创建 griffin 索引

[atguigu@hadoop102 ~]$ curl -XPUT http://hadoop102:9200/griffin -d '
{ "aliases": {}, 
"mappings": {
     "accuracy": {
         "properties": { 
                "name": { 
                       "fields": { 
                             "keyword": {
                                     "ignore_above": 256,
                                      "type": "keyword" 
                                 }
                                   },
                                     "type": "text" 
                                  },
                                "tmst": { 
                                     "type": "date" 
                                } 
                       } 
              }
 },"settings": { 
          "index": {
                "number_of_replicas": "2", 
                "number_of_shards": "5"
           } 
      } 
}'

2.1.2 启动 HDFS & Yarn 服务

[atguigu@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh 
[atguigu@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh

2.1.3 修改 Hive 配置

注意：Hive 版本至少 2.2 及以上

3）将 Mysql 的 mysql-connector-java-5.1.27-bin.jar 拷贝到/opt/module/hive/lib/

[atguigu@hadoop102 module]$ cp /opt/software/mysql-libs/mysql-connector-java-5.1.27/mysql-con nector-java-5.1.27-bin.jar /opt/module/hive/lib/

4）在/opt/module/hive/conf 路径上，修改 hive-site.xml 文件，添加红色部分。（注意 mysql

的密码要正确，否则元数据连接不上）

[atguigu@hadoop102 conf]$ vim hive-site.xml 
#添加如下内容 
<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
     <property> 
        <name>javax.jdo.option.ConnectionURL</name>                 
        <value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=tru e</value>
         <description>JDBC connect string for a JDBC metastore</description> 
      </property> 
      <property> 
           <name>javax.jdo.option.ConnectionDriverName</name>         
           <value>com.mysql.jdbc.Driver</value>     
           <description>Driver class name for a JDBC metastore</description>
       </property>
       <property> 
              <name>javax.jdo.option.ConnectionUserName</name> 
              <value>root</value> <description>username to use against metastore database</description> 
        </property> 
        <property> 
               <name>javax.jdo.option.ConnectionPassword</name> 
               <value>123456</value> <description>password to use against metastore database</description> 
       </property> 
       <property> 
               <name>hive.metastore.warehouse.dir</name> 
               <value>/user/hive/warehouse</value> 
               <description>location of default database for the warehouse</description> 
        </property> 
        <property> 
               <name>hive.cli.print.header</name> 
               <value>true</value> 
        </property> 
        <property> 
               <name>hive.cli.print.current.db</name> 
               <value>true</value> </property> 
        <property> 
               <name>hive.metastore.schema.verification</name> 
               <value>false</value> 
         </property> 
         <property> 
               <name>datanucleus.schema.autoCreateAll</name> 
               <value>true</value>
          </property> 
         <property> 
               <name>hive.metastore.uris</name> 
               <value>thrift://hadoop102:9083</value> 
          </property> 
</configuration>

3）启动服务

[atguigu@hadoop102 hive]$ nohup /opt/module/hive/bin/hive --service metastore & 
[atguigu@hadoop102 hive]$ nohup /opt/module/hive/bin/hive --service hiveserver2 &

注意：hive2.x 版本需要启动两个服务 metastore 和 hiveserver2，否则会报错 Exception in thread

"main"

java.lang.RuntimeException:

org.apache.hadoop.hive.ql.metadata.HiveException:

java.lang.RuntimeException:

Unable

instantiate

org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

4）服务启动完毕后在启动 Hive

[atguigu@hadoop102 hive]$ /opt/module/hive/bin/hive

2.1.4 安装 Spark2.4.6

注意：Spark 版本至少 2.2.1 及以上

1）把 spark-2.4.6-bin-hadoop2.7.tgz 上传到/opt/software 目录，并解压到/opt/module

[atguigu@hadoop102 software]$ tar -zxvf spark-2.4.6-bin-hadoop2.7.tgz -C /opt/module/

2）修改名称/opt/module/spark-2.4.6-bin-hadoop2.7 名称为 spark

[atguigu@hadoop102 module]$ mv spark-2.4.3-bin-hadoop2.7/ spark

3）修改/opt/module/spark/conf/spark-defaults.conf.template 名称为 spark-defaults.conf

[atguigu@hadoop102 conf]$ mv spark-defaults.conf.template spark-defaults.conf

4）在 spark-default.conf 文件中配置 Spark 日志路径

[atguigu@hadoop102 conf]$ vim spark-defaults.conf 
#添加如下配置 
spark.eventLog.enabled true 
spark.eventLog.dir 
hdfs://hadoop102:9000/spark_directory

5）修改配置文件 slaves 名称

[atguigu@hadoop102 conf]$ mv slaves.template slaves

6）修改 slave 文件，添加 work 节点：

[atguigu@hadoop102 conf]$ vim slaves
hadoop102 
hadoop103 
hadoop104

7）修改/opt/module/spark/conf/spark-env.sh.template 名称为 spark-env.sh

[atguigu@hadoop102 conf]$ mv spark-env.sh.template spark-env.sh

8）在/opt/module/spark/conf/spark-env.sh 文件中配置 YARN 配置文件路径、配置历史服务器相关参数

[atguigu@hadoop102 conf]$ vim spark-env.sh 
#添加如下参数 YARN_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop 
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 
-Dspark.history.retainedApplications=30 
-Dspark.history.fs.logDirectory=hdfs://hadoop102:9000/spark_di rectory" 
SPARK_MASTER_HOST=hadoop102 
SPARK_MASTER_PORT=7077

9）在 hadoop 集群上提前创建 spark_directory 日志路径

[atguigu@hadoop102 spark]$ hadoop fs -mkdir /spark_directory

10）把 Hive 中/opt/module/hive/lib/datanucleus-*.jar 包拷贝到 Spark 的/opt/module/spark/jars路径

[atguigu@hadoop102 lib]$ cp /opt/module/hive/lib/datanucleus-*.jar /opt/module/spark/jars/

9）把 Hive 中/opt/module/hive/conf/hive-site.xml 包拷贝到 Spark 的/opt/module/spark/conf路径

[atguigu@hadoop102 conf]$ cp /opt/module/hive/conf/hive-site.xml /opt/module/spark/conf/

10）修改 hadoop 配置文件 yarn-site.xml,添加如下内容：

[atguigu@hadoop102 hadoop]$ vi yarn-site.xml
<!--是否启动一个线程检查每个任务正使用的物理内存量，如果任务超出分配值， 则直接将其杀掉，默认是 true --> 
<property> 
    <name>yarn.nodemanager.pmem-check-enabled</name>     
    <value>false</value> 
</property> 
<!--是否启动一个线程检查每个任务正使用的虚拟内存量，如果任务超出分配值， 则直接将其杀掉，默认是 true --> 
<property> 
    <name>yarn.nodemanager.vmem-check-enabled</name> 
    <value>false</value> 
</property>

11）分发 spark & yarn-site.xml

[atguigu@hadoop102 conf]$ xsync /opt/module/hadoop-2.7.2/etc/hadoop/yarn-site.xml 
[atguigu@hadoop102 conf]$ xsync /opt/module/spark

10）测试环境

[atguigu@hadoop102 spark]$ bin/spark-shell 
scala>spark.sql("show databases").show

2.1.5 安装 Livy0.3

1）上传 livy-server-0.3.0.zip 到 hadoop102 的/opt/software 目录下，并解压到/opt/module

[atguigu@hadoop102 software]$ unzip livy-server-0.3.0.zip -d /opt/module/

2）修改/opt/module/livy-server-0.3.0 文件名称为 livy

[atguigu@hadoop102 module]$ mv livy-server-0.3.0/ livy

3）修改/opt/module/livy/conf/livy-env.sh 文件，添加 livy 环境相关参数

export HADOOP_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop/ 
export SPARK_HOME=/opt/module/spark/

3）修改/opt/module/livy/conf/livy.conf 文件，配置 livy 与 spark 相关参数

livy.server.host = hadoop102 
livy.spark.master =yarn 
livy.spark.deployMode = client 
livy.repl.enableHiveContext = true 
livy.server.port = 8998

4）配置需要的环境变量

[atguigu@hadoop102 conf]$ sudo vim /etc/profile 
#SPARK_HOME 
export SPARK_HOME=/opt/module/spark 
export PATH=$PATH:$SPARK_HOME/bin
[atguigu@hadoop102 conf]$ source /etc/profile

5）在/opt/module/livy/路径上，启动 livy 服务

[atguigu@hadoop102 livy]$ bin/livy-server start

2.1.6 初始化 MySQL 数据库

1）上传 Init_quartz_mysql_innodb.sql 到 hadoop102 的/opt/software 目录

2）使用 mysql 创建 quartz 库，执行脚本初始化表信息

[atguigu@hadoop102 ~]$ mysql -uroot -p123456 
mysql> create database quartz; 
mysql> use quartz; 
mysql> source /opt/software/Init_quartz_mysql_innodb.sql 
mysql> show tables;

2.1.6 初始化 MySQL 数据库

1）上传 Init_quartz_mysql_innodb.sql 到 hadoop102 的/opt/software 目录

2）使用 mysql 创建 quartz 库，执行脚本初始化表信息

[atguigu@hadoop102 ~]$ mysql -uroot -p123456 
mysql> create database quartz; 
mysql> use quartz; 
mysql> source /opt/software/Init_quartz_mysql_innodb.sql 
mysql> show tables;

2.2 编译 Griffin（不选择）

2.2.1 安装 Maven

1）Maven 下载：https://maven.apache.org/download.cgi

2）把 apache-maven-3.6.1-bin.tar.gz 上传到 linux 的/opt/software 目录下

3）解压 apache-maven-3.6.1-bin.tar.gz 到/opt/module/目录下面

[atguigu@hadoop102 software]$ tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/

4）修改 apache-maven-3.6.1 的名称为 maven

[atguigu@hadoop102 module]$ mv apache-maven-3.6.1/ maven

5）添加环境变量到/etc/profile 中

[atguigu@hadoop102 module]$ sudo vim /etc/profile #MAVEN_HOME
export MAVEN_HOME=/opt/module/maven 
export PATH=$PATH:$MAVEN_HOME/bin

6）测试安装结果

[atguigu@hadoop102 module]$ source /etc/profile 
[atguigu@hadoop102 module]$ mvn -v

7）修改 setting.xml，指定为阿里云

[atguigu@hadoop102 maven]$ cd conf
[atguigu@hadoop102 maven]$ vim settings.xml

<!-- 添加阿里云镜像--> 
<mirror> 
     <id>nexus-aliyun</id> 
     <mirrorOf>central</mirrorOf> 
      <name>Nexus aliyun</name> 
      <url>http://maven.aliyun.com/nexus/content/groups/public</ur l>
</mirror> 
<mirror> 
      <id>UK</id>
      <name>UK Central</name> 
       <url>http://uk.maven.org/maven2</url> 
       <mirrorOf>central</mirrorOf>
</mirror> 
<mirror> 
        <id>repo1</id> 
        <mirrorOf>central</mirrorOf> 
        <name>Human Readable Name for this Mirror.</name> 
        <url>http://repo1.maven.org/maven2/</url> 
</mirror> 
<mirror> 
        <id>repo2</id> 
        <mirrorOf>central</mirrorOf> 
        <name>Human Readable Name for this Mirror.</name> 
        <url>http://repo2.maven.org/maven2/</url> 
</mirror>

8）在/home/atguigu 目录下创建.m2 文件夹

[atguigu@hadoop102 ~]$ mkdir .m2

2.2.2 修改配置文件:

1）上传 griffin-master.zip 到 hadoop102 的/opt/software 目录，并解压 tar.gz 包到/opt/module

[atguigu@hadoop102 software]$ unzip griffin-master.zip -d /opt/module/

2）修改/opt/module/griffin-master/ui/pom.xml 文件，添加 node 和 npm 源。

[atguigu@hadoop102 ui]$ vim pom.xml 
<!-- It will install nodejs and npm --> 
<execution> 
          <id>install node and npm</id> 
          <goals>
                  <goal>install-node-and-npm</goal> 
           </goals> 
<configuration> 
           <nodeVersion>${node.version}</nodeVersion> 
            <npmVersion>${npm.version}</npmVersion> 
            <nodeDownloadRoot>http://nodejs.org/dist/</nodeDownloadRoot> 
            <npmDownloadRoot>http://registry.npmjs.org/npm/-/</npmDownloadRoot> 
     </configuration>
</execution>

2）修改/opt/module/griffin-master/service/pom.xml 文件，注释掉 org.postgresql，添加 mysql依赖。

[atguigu@hadoop102 service]$ vim pom.xml 
<!-- 
<dependency> 
       <groupId>org.postgresql</groupId> 
      <artifactId>postgresql</artifactId> 
        <version>${postgresql.version}</version> 
</dependency>
 --> 
<dependency>
        <groupId>mysql</groupId> 
         <artifactId>mysql-connector-java</artifactId> 
</dependency> 
注意:版本号删除掉

3）修改/opt/module/griffin-master/service/src/main/resources/application.properties 文件

[atguigu@hadoop102 service]$ vim /opt/module/griffin-master/service/src/main/resources/applicat ion.properties 
# Apache Griffin 应用名称 
spring.application.name=griffin_service 
# MySQL 数据库配置信息 spring.datasource.url=jdbc:mysql://hadoop102:3306/quartz?autoR 
econnect=true&useSSL=false 
spring.datasource.username=root
spring.datasource.password=000000 
spring.jpa.generate-ddl=true 
spring.datasource.driver-class-name=com.mysql.jdbc.Driver spring.jpa.show-sql=true 
# Hive metastore 配置信息 hive.metastore.uris=thrift://hadoop102:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms 
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000 
# Kafka schema registry 按需配置 kafka.schema.registry.url=http://hadoop102:8081 
# Update job instance state at regular intervals jobInstance.fixedDelay.in.milliseconds=60000 
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000 
# schedule predicate job every 5 minutes and repeat 12 times at most 
#interval time unit s:second m:minute h:hour d:day,only support these four units 
predicate.job.interval=5m 
predicate.job.repeat.count=12 
# external properties directory location 
external.config.location= 
# external BATCH or STREAMING env external.env.location= 
# login strategy ("default" or "ldap") login.strategy=default 
# ldap 
ldap.url=ldap://hostname:port 
ldap.email=@example.com ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0}) 
# hdfs default name 
fs.defaultFS= 
# elasticsearch 
elasticsearch.host=hadoop102 
elasticsearch.port=9200 
elasticsearch.scheme=http

# elasticsearch.user = user

# elasticsearch.password = password

# livy

livy.uri=http://hadoop102:8998/batches

# yarn url

yarn.uri=http://hadoop103:8088

# griffin event listener

internal.event.listeners=GriffinJobEventHook

4）修改/opt/module/griffin-master/service/src/main/resources/sparkProperties.json 文件

[atguigu@hadoop102

service]$

vim

/opt/module/griffin-master/service/src/main/resources/sparkPro

perties.json

{

"file": "hdfs://hadoop102:9000/griffin/griffin-measure.jar",

"className": "org.apache.griffin.measure.Application",

"name": "griffin",

"queue": "default",

"numExecutors": 2,

"executorCores": 1,

"driverMemory": "1g",

"executorMemory": "1g",

"conf": {

"spark.yarn.dist.files":

"hdfs://hadoop102:9000/home/spark_conf/hive-site.xml"

"files": [

]

}

5）修改/opt/module/griffin-master/service/src/main/resources/env/env_batch.json 文件

[atguigu@hadoop102 service]$ vim /opt/module/griffin-master/service/src/main/resources/env/env_ batch.json 
{ 
"spark": { "log.level": "INFO" },
"sinks": [ 
　　　　　　{ "type": "CONSOLE", 
　　　　　　　　"config": { "max.log.lines": 10 } },
　　　　　　{ "type": "HDFS", "config": { "path":"hdfs://hadoop102:9000/griffin/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } },
          { "type": "ELASTICSEARCH", "config": {
　　　　　　　　"method": "post", "api": "http://hadoop102:9200/griffin/accuracy", "connection.timeout": "1m", "retry": 10 } } ],
"griffin.checkpoint": [] }

6）修改/opt/module/griffin-master/service/src/main/resources/env/env_streaming.json 文件

[atguigu@hadoop102 service]$ vim /opt/module/griffin-master/service/src/main/resources/env/env_ streaming.json

{ 
"spark": { "log.level": "WARN", "checkpoint.dir": "hdfs:///griffin/checkpoint/${JOB_NAME}", "init.clear": true, "batch.interval": "1m", "process.interval": "5m", "config": { "spark.default.parallelism": 4, "spark.task.maxFailures": 5, "spark.streaming.kafkaMaxRatePerPartition": 1000, "spark.streaming.concurrentJobs": 4, "spark.yarn.maxAppAttempts": 5, "spark.yarn.am.attemptFailuresValidityInterval": "1h", "spark.yarn.max.executor.failures": 120, "spark.yarn.executor.failuresValidityInterval": "1h", "spark.hadoop.fs.hdfs.impl.disable.cache": true } },
"sinks": [ 
　　　　{ "type": "CONSOLE", "config": { "max.log.lines": 100 } },
　　　　{ "type": "HDFS", "config": { "path": "hdfs://hadoop102:9000/griffin/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } },
　　　　{ "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://hadoop102:9200/griffin/accuracy" } } ],
"griffin.checkpoint": [{ "type": "zk", "config": { "hosts": "zk:2181", "namespace": "griffin/infocache", "lock.path": "lock", "mode": "persist", "init.clear": true, "close.clear": false } } ] }

7）修改/opt/module/griffin-master/service/src/main/resources/quartz.properties 文件

[atguigu@hadoop102 service]$ vim /opt/module/griffin-master/service/src/main/resources/quartz.p roperties 
org.quartz.scheduler.instanceName=spring-boot-quartz 
org.quartz.scheduler.instanceId=AUTO 
org.quartz.threadPool.threadCount=5 
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStor eTX 
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate 
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate 
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjo bstore.StdJDBCDelegate 
org.quartz.jobStore.useProperties=true 
org.quartz.jobStore.misfireThreshold=60000 
org.quartz.jobStore.tablePrefix=QRTZ_ 
org.quartz.jobStore.isClustered=true 
org.quartz.jobStore.clusterCheckinInterval=20000

2.2.3 执行编译

1）在/opt/module/griffin-master 路径执行 maven 命令，开始编译 Griffin 源码

[atguigu@hadoop102 griffin-master]$ mvn -Dmaven.test.skip=true clean install

2）见到如下页面，表示编译成功。（大约需要 1 个小时左右）

本文来自博客园，作者：秋华，转载请注明原文链接：https://www.cnblogs.com/qiu-hua/p/13747505.html

离线电商数仓（六十六）之数据质量监控（二）Griffin（三） 安装及使用（一）

离线电商数仓（六十六）之数据质量监控（二）Griffin（三）安装及使用（一）